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Abstract 



The longest common subsequence (LCS) problem is a classical problem in 
computer science. The semi-local LCS problem is a generalisation of the 
LCS problem, arising naturally in the context of string comparison. Apart 
from playing an important role in string algorithms, this problem turns out 
to have surprising connections with computational geometry, algebra, graph 
theory, as well as applications in computational molecular biology. Our work 
gives a survey of recent results that expose these connections. We conclude 
that semi-local string comparison is a fascinating problem and a powerful 
algorithmic technique, which unifies and improves on a number of previous 
approaches. 
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Chapter 1 

Introduction 



1.1 Overview 

The longest common subsequence (LCS) problem is a classical problem in 
computer science. Given two strings a, b of lengths m, n respectively, the 
LCS problem asks for the length of the longest string that is a subsequence 
of both a and b. This length is called the strings' LCS score. We refer the 
reader to monographs [42, 63] for the background and further references. 

The semi-local LCS problem is a generalisation of the LCS problem, 
arising naturally in the context of string comparison. Given two strings a, b 
as before, the semi-local LCS problem asks for the LCS score of each string 
against all substrings of the other string, and of all prefixes of each string 
against all suffixes of the other string. In this work, we survey a number of 
algorithmic techniques related to the semi-local LCS problem, and present 
some algorithmic applications of these techniques. 

The rest of this chapter contains the necessary preliminaries. In Sec- 
tion 1.2, we establish the basic terminology and notation of points and 
matrices. In Section 1.3, we give the key definitions of distribution and 
density matrices. In Section 1.4, we introduce our main algorithmic tool: a 
special class of integer matrices, called simple unit-Monge matrices, which 
are obtained as distribution matrices of permutation matrices, and can be 
represented implicitly by a dominance counting data structure, such as a 
range tree. 

In Chapter 2, we describe our main algorithmic techniques. In Sec- 
tion 2.1, we introduce matrix distance multiplication, and study its alge- 
braic properties in the classes of Monge and simple unit-Monge matrices. 
In Section 2.2, we present an algebraic formalism for distance multiplica- 
tion of simple unit-Monge matrices, called seaweed braids. In Section 2.3, 
we give an efficient algorithm for distance multiplication of implicit simple 
unit-Monge matrices (or, equivalently, seaweed braids). In Section 2.4, we 
describe an application of this algorithm to deciding comparability in the 
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Bruhat partial order on permutations. 

In Chapter 3, we develop an application of our techniques to the semi- 
local LCS problem. In Section 3.1, we formally define the semi-local LCS 
problem and related concepts. In Section 3.2, we introduce alignment dags 
and highest-score matrices. Exploiting the algebraic framework of unit- 
Monge matrices, in Section 3.3 we obtain an efficient algorithm for highest- 
score matrix composition. 

In Chapter 4, we introduce a conceptually simple algorithm for the semi- 
local LCS problem, called the seaweed algorithm, and show a number of its 
applications. In Section 4.1, we describe the seaweed algorithm itself, and in 
Section 4.2, its slightly faster version, obtained by micro-block precomputa- 
tion. In Sections 4.3 and 4.4, we apply the seaweed algorithm to solving the 
incremental and the common-substring versions of the LCS and semi-local 
LCS problems. In Section 4.5 we give an algorithm for the cyclic LCS prob- 
lem, and in Section 4.6 for the longest repeating subsequence problem. All 
our algorithms match, improve on, and/or generalise existing algorithms for 
these problems. 

In Chapter 5, we generalise our techniques from LCS scores to arbitrary 
rational-weighted alignment scores and edit distances. In Section 5.1, we in- 
troduce the main concepts of weighted alignment, and describe an approach 
to rational-weighted alignment via the blow-up technique. In Section 5.2, 
we use the framework of weighted alignment to obtain algorithms for several 
versions of the approximate pattern matching problem. 

In Chapter 6, we describe an extension of the seaweed algorithm that 
allows efficient semi-local comparison of two input strings, one of which is 
periodic. In Section 6.1, we describe the periodic seaweed algorithm itself. 
By application of the periodic seaweed algorithm, in Section 6.2 we obtain 
new algorithms for the tandem LCS problem and the tandem cyclic align- 
ment problem, improving on existing algorithms in running time. 

In Chapter 7, we consider the semi-local LCS problem restricted to per- 
mutation strings of length n. In particular, Section 7.1 gives an algorithm 
for the semi-local LCS problem on permutation strings. By direct appli- 
cation of this algorithm, in Section 7.2 we obtain an improved algorithm 
for the cyclic LCS problem on permutations. Further applications include 
improved algorithms for the longest pattern-avoiding subsequence problem, 
given in Section 7.3, and for the longest /c-increasing and /c-modal sub- 
sequence problems, given in Section 7.4. In Section 7.5, we consider the 
maximum clique problem in a circle graph represented by an interval model 
of size n. By application our semi-local LCS algorithm on permutations, we 
obtain new algorithms for this problem, both for general and sparse circle 
graphs, achieving a substantial improvement on existing algorithms in run- 
ning time. In Section 7.6, we describe an application of these algorithms to 
the problem of finding exact and approximate commonly structured patterns 
in linear graphs. 
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In Chapter 8, we apply the semi-local LCS problem to compressed string 
comparison. Our goal is to obtain efficient algorithms that work on com- 
pressed strings without first decompressing them. In Section 8.1, we in- 
troduce the grammar compression (GC) framework, that generalises the 
classical LZ78 and LZW methods. In Section 8.2, we describe a folklore 
algorithm for global subsequence recognition on GC-strings. In Section 8.3, 
we give an efficient algorithm for the three-way semi-local LCS problem on 
GC-strings. By application of this algorithm, in Section 8.4 we obtain an 
algorithm for local subsequence recognition in GC-strings, and in Section 8.5 
an algorithm for threshold approximate matching in GC-strings; both these 
algorithms improve on the existing ones in running time. 

In Chapter 9, we consider applications of our techniques that aim to 
reach beyond semi-local string comparison, with the ultimate goal of ef- 
ficient fully- local comparison. In Section 9.1, we introduce the window- 
substring and window-window LCS problems, and give an algorithm for 
these problems. This algorithm provides a refinement for the standard dot 
plot method, by allowing efficient window- window string comparison based 
on the LCS score, rather than the less sensitive Hamming score. In Sec- 
tion 9.2, we introduce the quasi-local LCS problem, which generalises the 
semi-local, window-substring and window-window LCS problems, and give 
an efficient algorithm for this problem. By application of this algorithm, in 
Section 9.3 we obtain an algorithm for sparse spliced alignment under an ar- 
bitrary rational edit distance metric, which improves on existing algorithms 
for this problem. 

Some results presented in this work appeared incrementally in the au- 
thor's publications [121, 122, 124, 123, 125, 86, 126]. The aim of this work 
is to consolidate these results, unifying the terminology and notation. How- 
ever, a number of results have not been published before, and are original 
to this work. 

Summarising the presented results, we conclude that semi-local string 
comparison turns out to be a useful algorithmic plug-in, which unifies, and 
often improves on, a number of previous approaches to various substring- 
and subsequence-related problems. 

1.2 Points and matrices 

For indices, we will use either integers, or odd half-integers: 
{...,-2,-1,0,1,2,...} 

/ _5 _3 _1 1 3 5 \ 
V ■ ■ > 2' 2' 2'2'2'2'---J 

For ease of reading, odd half-integer variables will be indicated by hats (e.g. 
i, j). Ordinary variable names (e.g. i, j, with possible subscripts or super- 
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scripts), will normally denote integer variables, but can sometimes denote a 
variable that may be either integer, or odd half-integer. 
We denote integer and odd half-integer intervals by 

[i : j] = {i,i + 1, . . . , j - IJ} 

(i:j) = {i + l,i + l...,j-lj-l} 

Observe that in this notation, both an integer and an odd half-integer inter- 
val is defined by integer endpoints. To denote infinite intervals of integers 
and odd half-integers, we will use — oo for i and +00 for j where appropriate, 
so e.g. [—00 : +00] denotes the set of all integers, and (—00 : +00) the set 
of all odd half-integers. For finite intervals [i : j] and (i : j), we call the 
difference j — i interval length. 

When dealing with pairs of numbers, we will often use geometric lan- 
guage and call them points. We will write 

(Wo) < {k,ji) if *o < h and j < ji 
(Wo) S (Wi) if *o < h and j > j\ 

We will call these strict partial orders <C- and ^-dominance. When visualis- 
ing points, we will use the matrix indexing convention: the first coordinate 
in a pair increases downwards, and the second coordinate rightwards. Hence, 
the visual convention of the dominated point lying "below and to the left" 
of the dominating point, which is standard in computational geometry, cor- 
responds in this work to ^-dominance. 

We use standard terminology for geometric dominance and other partial 
orders. In particular, a set of elements forms a chain, if they are pairwise 
comparable, and an antichain, if they pairwise incomparable. Note that a 
<C-chain is a ^-antichain, and vice versa. An element in a partially ordered 
set is minimal (respectively, maximal), if, in terms of the partial order, it 
does not dominate (respectively, is not dominated by) any other element in 
the set. All minimal (respectively, maximal) elements in a partially ordered 
set form an antichain. 

A function of an integer argument will be called unit-monotone increas- 
ing (respectively, decreasing), if in every successive pair of values, the differ- 
ence between the successor and the predecessor is either or 1 (respectively, 
or -1). 

We will make extensive use of vectors and matrices with integer (oc- 
casionally, rational or real) elements, and with integer or odd half-integer 
indices. We regard a vector or matrix as a one- (respectively, two-) argument 
function, so we can speak e.g. about unit-monotone increasing matrices. 

We will sometimes consider matrices where one or both index range are 
non-consecutive sets of integers or odd half-integers; however, index ranges 
will always be assumed to be linearly ordered. Given two index ranges /, 
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J, it will be convenient to denote their Cartesian product by (I \ J). We 
extend this notation to Cartesian products of intervals: 

[*o : h I Jo : ji] ={[k- h] I [jo ■ ji]) 
(*o : h I jo ■ ji) = {(io ■ h) \ (jo ■ ji}) 

Given index ranges /, J, a vector over I is indexed by i G /, and a matrix 
over (I | J) is indexed by i £ I, j £ J. A vector or matrix is nonnegative, if 
all its elements are nonnegative. 

The matrices we consider can be implicit, i.e. represented by a compact 
data structure that supports access to every matrix element in a specified 
(small, but not necessarily constant) time. If the query time is not given, it 
is assumed by default to be constant. 

When considering matrices over non-consecutive index ranges, we will 
occasionally perform operations on such matrices as if they were over consec- 
utive intervals. This will have the following meaning: we remap the ranges 
to consecutive intervals preserving the linear order within each range, then 
we perform a matrix operation, and finally we remap the intervals back to 
the original ranges. 

We will use function notation for indexing matrices, e.g. A(i,j). We will 
use straightforward notation for selecting subvectors and submatrices: for 
example, given a matrix A over [0 : n | : n], we denote by A[io : i\ \ jo : ji] 
the submatrix defined by the given sub-intervals. A star * will indicate that 
for a particular index, its whole range is selected implicitly, e.g. A[* \ jo : 
ji] = A[0 : n | j : ji]. 

We will denote by A the transpose of matrix A. We will also denote by 
A R the matrix obtained from A by counterclockwise 90-degree rotation. For 
a matrix A over [0 : n | : n] (or [0 : n \ : n]), we have A T (i,j) = A(j, i), 
A R (i,j) = A(j,n — i) for all i,j. 

Given matrices A' over (/' | J') and A" over (/" | J"), where (/' | J') n 
(/" | J") = 0, their disjoint union is the matrix A'uA" over (I'll I" | J'llJ"), 
defined by 



(A'uA")(i,j) 



(A'(i,j) if i e j e J' 

A"(i,j) ]£i€l , ',j€J" 
otherwise 



The matrix disjoint union operation is associative, and can therefore be used 
for more than two matrices. 
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1.3 Distribution and density matrices 

Definition 1 Let D be a matrix over (iq : i\ \ jo : Its distribution 
matrix over [iq : i\ | jo : ji] is defined by 

d s (*,j) = £ 

«6<j:u)JeO'o:j> 

for all ie[i : h], j G [jo : ji]- □ 

Definition 2 Let vl be a matrix over [iq : i\ | jo : ji]. Its density matrix 
over (z : ii | jo : ji) is defined by 

A n (i,j) = A(i+l,j-l)-A(z-l,j-l) - 
A{i+\,j+\)+A{i-\,j+ l 2) 

for all i G (i : i x ), j G (j : ji). □ 

The definitions of distribution and density matrices extend naturally to ma- 
trices over an infinite index range, as long as the sum in Definition 1 is 
defined. 

Note that for any matrix D as above, and for all i, j, we have 

(D s ) D (i,j) = D(i,j) 

Also, for any matrix A as above, and for all i, j, we have 

(A n f(i,j) + b(j)+c( i ) = A( l J) 

where b = A(i±, *) is the bottom row vector, and c = A(*,jo) is the leftmost 
column vector of A. An important special case is when b, c are both zero 
vectors, as in the following definition. 

Definition 3 Matrix A will be called simple, if (^4 D ) S = A. □ 

Equivalently, a (finite) matrix is simple, if and only if its entries in the 
leftmost column and the bottom row are all zeros. 

The following classes of matrices play a fundamental role in optimisation 
theory (see [26] for an extensive survey), as well as in graph and string 
algorithms. 

Definition 4 Matrix A is called totally monotone, if 

A(i,j)>A(i,j')^A(i/,j)>A(i',j') 
for all i < i' , j < j'. □ 
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Definition 5 Matrix A is called a Monge matrix, if 

A(i,j) + A(i',j')<A(i,j')+A(i!,j) 

for all i < il ', j < j 1 '. Equivalently, matrix A is a Monge matrix, if A u is 
nonnegative. Matrix A is called an anti-Monge matrix, if —A is Monge. □ 

It is easy to see that Monge matrices form a subclass of totally monotone 
matrices. By Definition 5, a matrix is Monge, if and only if its density ma- 
trix is nonnegative. This condition is equivalent to the canonical structure 
theorem for Monge matrices, given by Burkard et al. [26]. 

1.4 Permutation and unit-Monge matrices 

Definition 6 A permutation (respectively, subpermutation) matrix is a zero- 
one matrix containing exactly one (respectively, at most one) nonzero in 
every row and every column. □ 

Typically, permutation and subpermutation matrices will be indexed by 
(not necessarily consecutive) odd half- integers. Given sets /, J of odd half- 
integers with |/| = |J|, a zero-one matrix P over (/ | J) is a permutation 
matrix, if and only if 

£P(*,j') = l J>(*',j) = l 

j'eJ i'ei 

for all % G I, j G J. An identity matrix is a permutation matrix Id over 
an interval range (io : i\ \ io : i\), such that Id(i,j) = 1, iff i = j. More 
generally, an offset indentity matrix is a permutation matrix Idh over an 
interval range (io : i\ \ jo : ji), where jo — io = ji — h = h, such that 
Idh(i,j) = 1, iff j — i = h. We have Ido = Id for any compatible index 
range. Clearly, an identity or offset indentity matrix can be represented 
implicitly in constant space and with constant query time. When dealing 
with identity and offset indentity matrices, we will often omit their index 
ranges, where they are clear from the context. 

When dealing with (sub)permutation matrices, we will write "nonzeros" 
for "index pairs corresponding to nonzeros" , as long as this does not lead to 
confusion. We will normally assume that a (sub)permutation matrix with n 
nonzeros is given implicitly by a compact data structure of size 0(n), that 
allows constant-time access to each nonzero both by the row and by the 
column index. 

Given a permutation matrix P over (/ | J), and a set I' C /, we will 
denote by P(I' \ •) the permutation submatrix row-induced by I', i.e. the 
permutation submatrix obtained by deleting from P all columns in / \ i 7 , 
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and then deleting from the remaining submatrix all zero rows. A column- 
induced permutation submatrix P(- \ J') is defined analogously. Both these 
operations can be implemented in linear time by a sweep of the nonzeros of 
matrix P. 

The following subclasses of Monge matrices play a crucial role in this 
work. 

Definition 7 A square matrix A is called a unit-Monge (respectively, subunit- 
Monge) matrix, if A u is a permutation (respectively, subpermutation) ma- 
trix. Matrix A is called a unit- anti- Monge (respectively, subunit-anti-Monge) 
matrix, if —A is unit-Monge (respectively, subunit-Monge) . □ 

By Definitions 5, 7, any subunit-Monge matrix is unit-Monge, and any unit- 
Monge matrix is Monge. Similar inclusions hold for (sub)unit-anti-Monge 
matrices. 

Matrices that are both simple and unit-Monge will be our main tool for 
the rest of this work. Note that such matrices are unit-monotone decreasing 
(respectively, increasing) in the first (respectively, second) index. Further- 
more, a square matrix A is simple unit-Monge, if and only if A = P s , where 
P is a permutation matrix. 

Example 1 The following matrix is simple unit-Monge: 



1 0' 

1 
1 



12 3' 

112 

1 





A permutation matrix P of size n can be regarded as an implicit rep- 
resentation of the simple unit-Monge matrix P s . Geometrically, a value 
P s (i,j) is the number of (odd half-integer) nonzeros in matrix P that are 
^-dominated by the (integer) point (i,j). An individual element P s (z,j) 
can be queried in time 0(n) by a sweep of the nonzeros of P, counting those 
that are ^-dominated by the query point (i,j). This procedure is known as 
geometric dominance counting. 

Using methods of computational geometry, matrix P can be preprocessed 
to allow element queries on P s (i,j) much more efficiently than by a direct 
linear sweep. 

Theorem 1 Given a (sub) permutation matrix P of size n, there exists a 
data structure which 

• has size O(nlogn); 

• can be built in time O (n log nj ; 

• allows to query an individual element of the simple (sub) unit- Monge 
matrix P^ in time O (log 2 n) ; □ 



10 



• 

• 

• 




• 




























• 




• 












• 

• 






• 

• 






• 



















































































Figure 1.1: A permutation matrix and the corresponding range tree 



Proof The required structure is a two-dimensional range tree [15] (see also 
[107]), built on the set of nonzeros in P. There are at most n nonzeros, hence 
the total number of nodes in the tree is O(nlogn). A dominance counting 
query on the set of nonzeros can be answered by accessing O (log 2 n) of the 
tree nodes. ■ 

Example 2 Figure 1.1 shows a 4 x 4 permutation matrix, with nonzeros 
indicated by green 1 bullets, and the corresponding range tree. □ 

The bounds given by Theorem 1 can be improved by employing more 
efficient data structures. Successive improvements in the efficiency of orthog- 
onal range counting (which includes dominance counting as a special case) 
were obtained by Chazelle [34], and by JaJa et al. [72]. The currently most 
efficient data structure, given by Chan and Patra§cu [30], has size 0(n), can 
be built in time (^(n^ogre) 1 / 2 ), and answers a dominance counting query in 
time 0( i giogn ) • However, the standard range tree data structure employed 
by Theorem 1 is simpler, requires a less powerful computation model, and 
is more likely to be practical. Therefore, we will be using Theorem 1 as 
our main technique for implicit representation of simple (sub)unit-Monge 
matrices. 

In addition to ordinary element queries described by Theorem 1, we will 
also deal with incremental queries, which are given an element of an implicit 
simple (sub)unit-Monge matrix, and return the value of an adjacent element. 
This kind of query can be answered directly from the (sub)permutation 
matrix, without any preprocessing. 

Theorem 2 Given a (sub) permutation matrix P of size n, and the value 
P s (i, j) ; i,j £ [0 : n], the values P s (i ± l,j), P^(i,j ± 1), where they exist, 
can be queried in time O(l). □ 

1 For colour illustrations, the reader is referred to the online version of this work. If the 
colour version is not available, all references to colour can be ignored. 
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Proof Let P be a permutation matrix; a generalisation to subpermutation 
matrices is straightforward. Consider a query of the type + 1, j); other 
query types are obtained by symmetry. Let j G (0 : n) be such that P(i + 
j) = 1; value j can be obtained from the permutation representation of P 
in time 0(1). We have 



P E (i + U) = P E (i,j) 



1 if 3 < 3 
otherwise 



Incremental queries described by Theorem 2 can be used to answer batch 
queries, returning a set of elements in a row, column or diagonal of an 
implicit simple (sub)unit-Monge matrix. In particular, all elements in a 
given row, column or diagonal of matrix P s can be obtained by a sequence 
of incremental queries in time 0(n), and a subset of r consecutive elements 
in time O (r + log 2 n) . 
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Chapter 2 

Matrix distance 
multiplication 



2.1 Distance multiplication monoids 

We will make extensive use of the (min, +)-semiring on integer or real num- 
bers, where the operators min and +, denoted by © and 0, play the role of 
addition and multiplication, respectively. This semiring is often called dis- 
tance (or tropical) algebra; for an extensive review of this and related topics, 
see e.g. Rote [113], Gondran and Minoux [61]. Our techniques are based on 
matrix-vector and matrix-matrix multiplication in the distance algebra. 

Definition 8 Let Abe a matrix over [io : i\ | jo : ji]. Let b, c be vectors 
over [jo : j\\ and [io : ii] respectively. The matrix-vector distance product 
A b = c is defined by 

c(i) = (A(i,j) b{j)) = min (A(i,j) + b(j)) 

for all i £ [io : □ 

Definition 9 Let A, B, C be matrices over [io : h \ jo '■ ji], [jo ■ ji \ &o : fci], 
[io '■ ii \ ko : k±] respectively. The matrix distance product A B = C is 
defined by 

C(i,k)= © {A(i,j)QB(j,k))= min (A(i,j) + B(j, k)) 

for all i 6 [io '■ h], k £ [ko ■ k±\. □ 

Like any multiplication of matrices over a semiring, matrix distance mul- 
tiplication is associative. The set of all square matrices with elements in 
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[0 : oo] over a given index range forms a monoid with respect to distance 
multiplication. The identity element in this monoid is the matrix 



Id Q (i,j) 



if i = j 
+00 otherwise 



The distance multiplication monoid also has an annihilator Oq, where & {i,j) 
+00 for all i,j. For all A, we have 

A Id Q = Id Q A = A AQO Q = O A = Q 

It is well-known that the set of all Monge matrices is closed under 
distance multiplication. It is slightly more surprising, but crucial for our 
method, that the same is also true for the set of all simple (sub)unit-Monge 
matrices. 

Theorem 3 Let A, B, C be matrices, such that A B = C . If A, B are 
Monge (respectively, simple unit-Monge, simple subunit- Monge), then C is 
also Monge (respectively, simple unit-Monge, simple subunit-Monge) . □ 

Proof First, let A, B be Monge matrices. Let i<i',k< k' . By definition 
of matrix distance multiplication, we have 

C(i,k') = mm(A(i,j*) + B(j*,k')) 

C(i',k) = mm(A(i!,j*) + B(j*,k)) 
3 

Let j, j' respectively be the values of j* on which these minima are attained. 
Suppose j < j'. We have 

C(i, k) + C{i', k') = (definition of 0) 

mm(A(i,f) + B(f,k))+mm(A(i',f) + B(j*,k')) < 

3* 3* 

(A{i,j) + B(j, k)) + (A(i', j') + B(j' , k')) = (term rearrangement) 

(A(i,j) + A(i',j')) + (B(j, k) + B(f, k')) < (A is Monge) 

(A(i,j') + A(i',j)^ + (B(j, k) + B(j' , k')) = (term rearrangement) 

(A(i,f) + B(j',k')) + (A(i',j) + B(j,k)) = (definition of j, f) 

C{i,k') + C(i',k) 

The case j > j' is treated symmetrically by the Monge property of B. Hence, 
matrix C is Monge. 

Now, let A, B be simple unit-Monge matrices. We have A = P^, B = 
Pg. It is easy to check that matrix C is simple. Let C = P^; we have to 
show that Pc is a permutation matrix. Without loss of generality, let Pa, 
Pb, Pc be over (0 : n | : n). Clearly, matrices C and Pc are both integer. 
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Furthermore, matrix C is Monge by the previous argument, and therefore 
matrix Pq is nonnegative. 
For any i G [0 : n], we have 

C(i,0) = min(Pf(i,j)+P|(j,0)) =min(Pf(i,j) + 0) =0 
C(i,n) = min(Pf(i,j) + P|0>)) = min(P|(i, j) + n - j) = n - t 

where the minimum is attained respectively at j = and j = n. Therefore, 
for all i G (0 : n), we have 

^] Pc(£, fe) = (definition of E and □) 

k 

Y^{c{i+\,k-\)-c{i-\,k-\)- 

k 

C(i + 7;, k + 7j) + C(i — |, A; + 5)) = (term cancellation) 

C(*+i,0)-C(t- i,0)-C(£+i,n) + C(?- i,n) = 
Q-Q-{n-i-\) + (n-i+\) = l 

Symmetrically, for all k G (0 : n), we have 
£Pc(?,*) = l 

Taken together, the above properties imply that matrix Pq is a permutation 
matrix. Therefore, C is a simple unit-Monge matrix. 

Finally, let A, B be simple subunit-Monge matrices. We have A = P^, 
5 = P|, where P A , P B 

are subpermutation matrices. As before, let C = P^; 
we have to show that Pc is a subpermutation matrix. Without loss of 
generality, let P A , Pb, Pc be over (0 : n' | : n"), (0 : n" | : n'"}, (0 : ri \ 
: n'"), respectively. Suppose that for some i, row Pa(i,*) contains only 
zeros. Then, it is easy to check that row Pc(i,*) also contains only zeros. 
Symmetrically, a zero column Ps(*, k) results in a zero column Pc(*,k). 
After deleting all zero rows from P4, all zero columns from Pg, and all the 
corresponding zero rows and columns from Pc, the equality P^ Pg = Pq 
still holds; however, matrix Pq is not necessarily a permutation matrix, and 
may not even be square. 

Without loss of generality, we may now assume that n' < n", n'" < n", 
and that subpermutation matrix Pa (respectively, Pb) does not have any 
zero rows (respectively, zero columns). Therefore, matrices Pa, Pb have 
exactly n' and n'" nonzeros, respectively. We now extend matrix Pa to a 
permutation matrix Pa over (n' — n" : n' \ : n"), by adding n" — n' rows 
with appropriately placed nonzeros at the top of the matrix. Similarly, we 
extend matrix Pg to a permutation matrix Pg over (0 : n" \ : n"), by 
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adding n'" — n' columns with appropriately placed nonzeros at the right- 
hand side of the matrix. Let Pa □ Pb = Pc, where Pq is a permutation 
matrix over (n' — n" : n' | : n"). We have 

P c = P A HPb = Pc(0 : n' I : ri") 

Hence, matrix Pq is a subpermutation matrix, and the original matrix C is 
a simple subunit-Monge matrix. ■ 

The distance multiplication identity matrix Id® can be formally considered 
a Monge matrix (since all indeterminate expressions of the form +00 — 00 
in its density matrix can be considered nonnegative) . Therefore, the set of 
all square Monge matrices forms a submonoid in the distance multiplication 
monoid of general matrices. 

The set of all simple unit-Monge matrices over a given index range also 
forms a monoid with respect to distance multiplication. The identity element 
in this monoid is the distribution matrix of the identity permutation matrix 
Id S : 



i if i < j 
otherwise 



The simple unit-Monge matrix monoid also has an annihilator (Pi^) 2 (recall 
Id R is the matrix obtained by 90-degree rotation of Id). For all P, we have 

P s © /d s = Pi s P s = P s 

P s (Id R f = (Id R f 0P S = (Id R f 

Theorem 3 gives us the basis for performing distance multiplication of 
simple (sub) unit-Monge matrices implicitly, by taking the density (sub)permutation 
matrices as input, and producing a density (sub) permutation matrix as out- 
put. It will be convenient to introduce special notation for implicit distance 
multiplication of this kind. 

Definition 10 Let P be a (sub)permutation matrix. Let b, c be vectors. 
The implicit matrix-vector distance product P □ b = c is defined by P s 06 = 
c. □ 

Definition 11 Let Pa, Pb, Pc be (sub) permutation matrices. The implicit 
matrix distance product Pa H Pb = Pc is defined by Pg = Pq . □ 

The set of all permutation matrices over (0 : n \ : n) is therefore a monoid 
with respect to implicit distance multiplication □, with identity element Id 
and annihilator Id R . 

Example 3 In Figure 2.1, Subfigure 2.1a shows a triple of 6 x 6 permutation 
matrices Pa, Pb, Pc, with nonzeros indicated by green bullets, such that 
P A □ Pb = Pc- □ 
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(b) As seaweed braids 



Figure 2.1: Implicit matrix distance product Pa E Pb = -Pc 

2.2 Seaweed braids 

Further understanding of the distance multiplication monoid of simple unit- 
Monge matrices can be gained by the following construction. Given a per- 
mutation matrix P over (J | J), we represent the indices in sets I and J 
by nodes on two parallel lines in the Euclidean plane, respecting the order 
of indices within each set. We represent every nonzero P(i,j) = 1 by con- 
necting node i S I with node j S J by a continuous monotone line called a 
seaweed. We call the resulting configuration a seaweed braid. Unless P is the 
identity matrix Id, some of the seaweeds in the seaweed braid will have to 
cross. However, the seaweeds are drawn so that any "unnecessary" crossings 
are avoided; in other words, a given pair of seaweeds can only cross at most 
once. 

Consider the implicit distance product Pa □ Pb = Pci where Pa, Pb, 
Pc are permutation matrices over (/ | J), (J \ K) and (I \ K), respectively. 
We represent the indices in sets /, J, K by nodes on three parallel lines, and 
the nonzeros of the input matrices Pa, Pb by two seaweed braids connecting 
the corresponding nodes. A seaweed braid for the output matrix Pq can be 
obtained as follows. First, we remove the nodes representing the index set 
J. At each removed node j E J, we join up the two incident seaweeds, which 
represent nonzeros Pa(i,j) = 1 and Pb(j, k) = 1. We now have a seaweed 
configuration between nodes of I and nodes of K. However, some seaweed 
pairs in this configuration may cross twice. We now "comb" the seaweeds 
by running through all their crossings, respecting the top-to-bottom partial 
order of the crossings from I to K. For each crossing, we check whether the 
two crossing seaweeds have previously crossed above the current point. If 
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this is the case, then we undo the current crossing by cutting it out of the 
configuration, and replacing it by two non-crossing seaweed pieces. After all 
the crossings have been processed, the resulting configuration is a seaweed 
braid representing the output matrix Pq. 

Example 4 In Figure 2.3, Subfigure 2.1b shows the seaweed braids repre- 
senting the implicit matrix distance product in Subfigure 2.1a. □ 

Seaweed braids can be formalised algebraically as follows. The seaweed 
monoid T n is a finitely presented monoid on n generators id, g±, <?2, ■ ■ ■ , 9n-i- 
Generator id is the identity element, which corresponds to a seaweed braid 
where all the seaweeds are parallel. Each of the remaining generators gt 
corresponds to a seaweed braid where all the seaweeds are parallel, except 
a pair of neighbouring seaweeds in positions t — \ and t + \, which do 
cross. In matrix notation, the identity generator id corresponds to the 
simple unit-Monge matrix Id^, and each generator gt corresponds to a simple 
unit-Monge matrix Pp, where an elementary transposition matrix Pt is a 
permutation matrix defined by Pt(i, j) = 1 iff i = j {t — |, t+ \} or {i, j] = 
{t — \ ,t+\}. Concatenation of words in the generators corresponds to the 
composition of seaweed braids. The presentation of monoid T n consists of 
the idempotence relations 

9t=9t *€[l:n-l] (2.1) 

and the braid relations 

gtg u = 9u9t t,u e [1 : n - l],u - 1 > 2 (2.2) 

9t9u9t = 9u9t9u t,ue [l:n-l],u-i = l (2.3) 

Intuitively, relations (2.1) express the seaweeds' double crossing property; 
relations (2.2) express the commutativity of independent seaweed crossings 
(note that pairs of crossing with \t — u\ < 1 are not independent, so their 
corresponding generators do not commute); and relations (2.3) give two 
equivalent expressions for a local crossing of three seaweeds. 

We are now able to establish a formal connection between distance mul- 
tiplication of simple unit-Monge matrices and the seaweed monoid. 

Theorem 4 The distance multiplication monoid ofnxn simple unit-Monge 
matrices is isomorphic to the seaweed monoid T n . □ 

Proof It is straightforward to check that any simple unit-Monge matrix 
can be decomposed into a distance product of matrices P t s for various 
values of t; this can be visualised as drawing a seaweed configuration for P, 
and decomposing it into individual seaweed crossings. Hence, matrices Pp 
serve as generators for the distance multiplication monoid of simple unit- 
Monge matrices. By using the defining relations of the seaweed monoid, it 
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is also straightforward to check that multiplication in both monoids agrees 
on the generators. By associativity of multiplication, this implies that mul- 
tiplication in monoids agrees on all the elements, therefore the two monoids 
are isomorphic. ■ 

The classical positive braid monoid (see e.g. [77, Section 6.5]) on genera- 
tors id, g±, <72, • • • , g-n-i, is defined by the braid relations (2.2)-(2.3) alone. 
Therefore, the seaweed monoid is isomorphic to the quotient of the positive 
braid monoid by the idempotence relations (2.1). The seaweed monoid has 
been previosly introduced as the O-Hecke monoid (of a symmetric group) 
by Fomin and Greene [54], and by Buch et al. [24]. A generalisation of the 
seaweed monoid is given by O-Hecke monoids of Coxeter groups, also known 
as Coxeter monoids, which arise naturally as subgroup monoids in groups. 
The theory of Coxeter monoids can be traced back to Bourbaki [23], and 
has been developed by Tsaranov [128] and Richardson and Springer [111]. 
The contents of this and the following sections can be regarded as the first 
step in the algorithmic study of Coxeter monoids. 

2.3 Distance multiplication algorithms 

In this section, we show that distance multiplication of Monge and simple 
unit-Monge matrices can be performed much more efficiently than a naive 
implementation of the definitions, by exploiting the special properties of the 
matrices. For simplicity, we only consider square matrices, although some 
of the results generalise to rectangular ones. 

We begin with matrix-vector distance multiplication. For generic, ex- 
plicitly represented matrices, the fastest method for matrix-vector distance 
multiplication of size n is by direct application of Definition 8 in time 0(n 2 ). 
For implicit Monge matrices, the running time can be substantially reduced 
by an application of a classical row minima searching algorithm by Aggarwal 
et al. [1] (see also [58]), often nicknamed the "SMAWK algorithm". 

Lemma 1 ([1]) Let A be an nx n implicit totally monotone matrix, where 
each element can be queried in time q. The problem of finding the (say, 
leftmost) minimum element in every row of A can be solved in time 0(qn).n 

Proof We give a sketch of the proof; for details, see [1, 58]. 

Let A' be an implicit ^ x n matrix, obtained by taking every other row 
of A. Clearly, at least § columns of A' do not contain any of its leftmost 
row minima. The key idea of the algorithm is to eliminate such columns in 
an efficient process, based on the total monotonicity property. 

Let A' be over [iq : i\ | jo : ji], i\ — io = § , ji — jo = n. The column 
elimination procedure builds a "staircase" of matrix entries that belong to 
yet uneliminated columns, but are already known not to contain a leftmost 
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i <r- i j <r- j f <r- j + 1 

while j' < j±: 
case A(i,j) < A(i,f): 

case i < %\\ % 4- i + 1 j 4- j' 

case i = i\: eliminate column j' 

/ 4-f + l 
case > 

eliminate column j 

casei = i : j «- j' f<-j' + l 

case i > io- i 4- i—1 j 4- max{/c : uneliminated and < j} 

Table 2.1: Elimination procedure of Lemma 1. 

row minimum. In every iteration, the "staircase" is either extended by 
one "step" , or a whole column is eliminated from the matrix, reducing the 
"staircase" by one "step" . The decision depends on the comparison of two 
elements A(i,j), A(i,f) at the tip of the "staircase". Following the compar- 
ison and possible column elimination, indices i, j, j' are updated to point 
to the tip of the updated (extended or reduced) "staircase" . The full elim- 
ination procedure is given in Table 2.1. By storing indices of uneliminated 
columns in an appropriate dynamic data structure, such as a doubly-linked 
list, a single iteration of this procedure can be implemented to run in time 
0{q). The whole elimination procedure runs in time 0(qn), and eliminates 
| columns. 

Let A" be the f x ^ matrix obtained from A' by deleting the eliminated 
columns. We call the algorithm recursively on A" . Given the output of 
the recursive call, which provides the leftmost row minima of A', it is now 
straightforward to fill in the leftmost minima in the remaining rows of A in 
time 0{qn). Thus, a single recursive call runs in time 0{qn). The amount 
of work halves in every level of recursion, therefore the overall running time 
is 0(qn). ■ 

Example 5 Figure 2.2 gives a snapshot of the two non-boundary cases of 
the elimination algorithm in the proof of Lemma 1. Each vertical dotted line 
represents an arbitrary number of consecutive eliminated columns. Dark- 
shaded cells represent the "staircase" of matrix entries that belong to yet 
uneliminated columns, but are known not to contain any leftmost row min- 
ima. The current elements A(i,j), A(i,f) at the tip of the "staircase" are 
shown by white circles. 

Subfigure 2.2a shows the case A(i,j) < A(i,j'), i < i±, and Subfig- 
ure 2.2b the case A(i,j) > A(i,j'), i > i\. In both cases, the light-shaded 
cells represent the new entries that become known not to contain any left- 
most row minima. In Subfigure 2.2a, these new entries extend the "staircase" 
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(a) Case A(i,j) < A(i,j'), i < h (b) Case A(i,j) > A(i,f), i > i 



Figure 2.2: A snapshot of the proofs of Lemmas 1 and 2 

by one "step". In Subfigure 2.2b, these new entries result in the elimination 
of a whole column, reducing the "staircase" by one "step" . In both cases, 
the new positions for elements A(i,j), A(i,f) are shown by black bullets. □ 

Theorem 5 Let A be an nx n implicit Monge matrix, where each element 
can be queried in time q. Let b, c be n-vectors, such that AQ b = c. Given 
vector b, vector c can be computed in time 0(qn). □ 

Proof Let A(i, j) = A(i,j) + b(j). Matrix A is an implicit Monge matrix, 
where each element can be queried in time q + 0(1). The problem of com- 
puting the product A b = c is equivalent to searching for row minima in 
matrix A, which can be solved in time 0{qn) by Lemma 1. ■ 

Let us now restrict our attention to implicit unit-Monge matrices. By 
Theorem 1, an element of such a matrix (represented by an appropriate data 
structure) can be queried in time q = 0(log 2 n). By plugging this query time 
into Theorem 5, we obtain immediately an algorithm for implicit matrix- 
vector distance multiplication, running in time 0(n log 2 n). A more careful 
analysis of the proof of Lemma 1 shows that its required matrix elements 
can be obtained, instead of the standalone element queries of Theorem 1, 
by more efficient incremental queries of Theorem 2, with q = 0(1). As a 
result, we obtain the following algorithm. 

Lemma 2 Let A be an nx n implicit (sub) unit-Monge matrix over [iq : i\ \ 
jo '■ ji]> represented by permutation matrix P = A u and vectors b = A(i\, *), 
c = A(*,jo). The problem of finding the (say, leftmost) minimum element 
in every row of A can be solved in time 0(n). □ 

Proof We follow the algorithm outlined in the proof of Lemma 1. We claim 
that under the conditions of the current lemma, every matrix element query 
in the algorithm can be performed as an incremental query of Theorem 2. 

Consider the column elimination procedure of Lemma 1. At the end of 
a loop iteration, we have just updated the values of i, j, j'. For the next 
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iteration, we need to query A(i, j), A(i, f) with the new index values. Given 
the old value of A(i, j'), these new values can be obtained by two incremental 
queries in time O(l), except where the previous iteration has gone through 
the case A(i,j) > A(i,j'), i > iq. In this case, the new value A(i,f) can 
still be obtained by an incremental query from the old A(i,j'). To query 
the new value A(i,j), we observe that the new i is the old i — 1, and the 
new j is the index of the rightmost uneliminated column to the left of the 
old j. If the new i is equal to io, querying A(i, j) is trivial. Otherwise, the 
value A(i — has already been queried at some point in the past, and 
therefore A(i,j) can be obtained from A(i — by an incremental query 
in time 0(1). 

Now consider the row fill-in procedure. Here, the elements of matrix A 
are queried consecutively either by row, or by column. Let A(i,j) be the 
current element. The next element to be queried is either A(i,j + 1), or 
A(i + 2,j + 1). In both cases, this can be achieved by an incremental query 
in time 0(1). 

Thus, every element query can be performed in time O(l). Therefore, 
the overall running time is 0{n). ■ 

Example 6 In Figure 2.2, the incremental queries within the elimination 
algorithm in the proof of Lemma 2 are shown by arrows. Note that no 
incremental query can cross a vertical dotted line, since every such line 
represents an arbitrary number of eliminated columns. □ 

Theorem 6 Let P be an n x n (sub) permutation matrix. Let b, c be n- 
vectors, such that P □ b = c. Given the nonzeros of P and the full vector b, 
vector c can be computed in time 0(n). □ 

Proof By Theorem 5 and Lemma 2. ■ 

We now consider matrix-matrix distance multiplication. For generic ma- 
trices, direct application of Definition 9 gives an algorithm for matrix dis- 
tance multiplication of size n, running in time 0(n 3 ). Slightly subcubic 
algorithms for this problem have also been obtained. The fastest currently 
known algorithm is by Chan [31] , running in 

time 0( n3( ^ n)3 ). 
For Monge matrices, distance multiplication can be easily performed in 
quadratic time. 

Theorem 7 Let A, B , C be n x n matrices, such that A is Monge, and 
A B = C . Given matrices A, B, matrix C can be computed in time and 
memory 0(n 2 ). □ 

PROOF The problem of computing the product AqB = C is equivalent to n 
instances of the matrix- vector product A b = c, where b (respectively, c) is 
a column of B (respectively, C). Every one of these instances can be solved 
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in time 0(n) by Theorem 5, so the overall running time (and therefore also 
memory) is n • 0(n) = 0(n 2 ). ■ 

When matrices are represented explicitly, the running time in Theorem 7 
is clearly optimal. However, for implicit simple unit-Monge matrices, the 
distance multiplication time can be reduced further. In [121, 124], we gave 
an algorithm running in time 0(n 15 ). In [127], we improved upon this as 
follows. 

Theorem 8 Let Pa, Pb, Pc be nx n (sub) permutation matrices, such that 
Pa H Pb = Pc- Given the nonzeros of Pa, Pb, the nonzeros of Pc can be 
computed in time 0(n log n). □ 

PROOF Let Pa, Pb, Pc be permutation matrices over (0 : n | : n). The 
algorithm is defined by recursion on n. 

Recursion base: n = 1. The computation is trivial. 

Recursive step: n > 1. Assume without loss of generality that n is even. 
Informally, the idea is to split the range of index j in the definition of matrix 
distance product (Definition 9) into two subranges of size ^. For each of 
these subranges of j, we use the sparsity of the input permutation matrix 
Pa (respectively, Pb) to partition the range of index i (respectively, k) into 
two disjoint, not necessarily contiguous, subsets of size ^. We then call the 
algorithm recursively on the two resulting half-sized subproblems, and use 
the two returned half-sized permutation matrices to reconstruct the output 
permutation matrix Pc, relying on the Monge properties of the respective 
distribution matrices. 

We now describe the recursive step in more detail. We have 

Let 

P A ,io = Pa{* I : f ) P B ,io = Pb(0 : § I *) P| to P|, to = P^ 
Pam = Pa(* I | : n) P B , hi = P B (f : n | *) P% hi P^ M = Pg w 

In the first subproblem, matrices Pa,i , Pb,Io are rectangular (respectively, 
nx ^ and § xn) subpermutation matrices, each with ^ nonzeros. Recall from 
the proof of Theorem 3 that a zero row (respectively, column) in Pa,Io, Pb,Io 
corresponds to a zero row (respectively, column) in their implicit distance 
product Pc,io- Therefore, we can delete all zero rows and columns from 
Pa,Io, Pb,Io, Pc,lo-> obtaining, after appropriate index remapping, three ^ x 3 
permutation matrices. Consequently, the first subproblem can be solved 
by first performing a linear-time index remapping (corresponding to the 
deletion of zero rows and columns from Pa,Io, Pb,Io), then making a recursive 
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call on the resulting half-sized problem, and then performing an inverse index 
remapping (corresponding to the reinsertion of the zero rows and columns 
into Pc,io)- The second subproblem can be solved analogously. 

Since the nonzeros in the two subproblems have disjoint index ranges, 
the sum Pc,i + Pc,h% is an n x n permutation matrix. We have 

P%(i,k)= mm(P${i,j) + Pf(j, A;)) = 

min( min (P%(i,j) + P]j(j,k)), min (Pj({i,j) + Pf(j, A;))) 

for all i, k G [0 : n]. The first argument in the above expression can now be 
rewritten as 

min (P%(i,j)+PE(j,k)) = 

min (P|j*, j) + P| >to 0', fc) + if W (§, fc)) = 
ie[o : f] 

min (Pf, i0 (i,j) + P|, i0 (j,fc)) +P| )W (f, fc) = 
ie[o:f] 

Pg to (i,fc) + i^ lW (0,fc) 

The second argument can be rewritten analogously, so we have 

P§(i, k) = mm(P* l0 (i, k) + Pg M (0, fc), Pg M (i, fc) + Pg to (i, n)) 

for all i, k G [0 : n] . In order to compute the nonzeros of matrix Pq efficiently, 
consider the difference of arguments in the above expression: 

S(i,k) = 

(P* lo (i,k) + P* hl (0,k)) - {PE M (i,k) + P^ l0 (i,n)) = 
{PE M (0,k) - P£ M (i,k)) - (Pg to (i,n) - PE tlo (i,k)) = 

ie{0:i),ke{0:k) ie{i:n) ,ke(k:n) 

Since Pc,io, Pc,hi are subpermutation matrices, and Pc,i + Pc,hi a permu- 
tation matrix, it follows that function 5 is unit-monotone increasing in each 
of its arguments. 

The sign of function 5 plays an important role in determining the posi- 
tions of nonzeros in Pq- Let us fix some i, k £ (0 : n), and consider the four 
values S(i ± ^, k ± \)- Three cases are possible. 

Case 8{i + ^, k + < 0. By monotonicity of 5, we have 5(z ± |, At ± ^) < 
for all four sign combinations. Therefore, 

p§(t±l,k±i) = p* l0 (i±ik±i) 



24 



for all four sign combinations chosen consistently on both sides of the equa- 
tion. Hence, Pc{i,k) = P c ,io{hk). 

Case 5(i — \, k — \) > 0. Symmetrically, we have Pc{i, k) = Pc,hi{i, k). 

Case 8(1 — |, k — ^) < and 5{i + ^, k + \) > 0. By unit-monotonicity of 5, 
we have 

5{i-\,k+\) = 5{i+\,k-\) = Q 
Therefore, 

P§(i + lk + l) = P* h S+hk + l)<PE jl0 (i+l,k + l) (2.4) 
and, furthermore, 

P§(i ± I k ± i) = Pg to (* ± i fc ± |) (2.5) 

for the remaining three sign combinations chosen consistently on both sides 
of the equation. By Definitions 1, 2, we have 

p c (t,k) =pB(.-.) -pB(.-.) -PB{i+lk + \) +pE(...) 

P C ,lo& k) = Pc,lo(- ■ •) - P C,lo(- ■ •) - P E,lo(i +\,k+\) + PE,lo(- ■ •) 

By (2.4), we have a strict inequality between the two additive terms fully 
shown above, and by (2.5), the abbreviated additive terms are pairwise equal 
between the two expressions. Hence, Pc(i,k) > Pc,lo(hk). Since both Pc, 
Pc,io are zero-one matrices, this implies that Pc(i, k) = 1 and Pc,io{h k) = 
(symmetrically, also Pc,hi{hk) = 0). 

Summarising the above three cases, we have Pc(h k) = 1, if and only if 
one of the following mutually exclusive conditions holds: 

Pc,io(^ k) = 1 and 5(i+\,k + \) <Q (2.6) 
P C)W (», k) = 1 and 6(i - \, k - \) > (2.7) 
S(i - \, k - \) < and 8(i + \, k + \) > (2.8) 

In order to perform the checks (2.6)-(2.8) efficiently, it is sufficient to find 
for each d G [— n + 1 : n — 1] a value r(d) G [1 : In — 1], such that r(d) + d 
is odd, and 

5{i,k)<0 if i + k < r(k - i) 
5{i,k)>0 if i + k>r(k-i) 

for all i, k G [0 : n\ (note that the above list of two cases is exhaustive, 
since r(k — i) — (i + k) = r(k — i) + (k — i) — 2k must be odd, and therefore 
i+k ^ r(k — i)). Such a value r(d) is guaranteed to exist by the monotonicity 
of function S. Furthermore, values r(d) can be chosen so that \r(d + \) — 
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r(d — |)l = 1 for all d G (—n + 1 : n — 1). Informally, array r defines 
a monotone rectilinear path, consisting of points d , r ( ci ) +d ) ; from the 

bottom-left to the top-right corner of (0 : n | : n). 
By definition of array r, for each ii we have 

w - {d) = S ( r -W^=±, m^tl) e 0] 

w +( d ) = s( r -w^±±, m+d+i) e [0, 1] 

We call the values w~(d), w + (d) witnesses for r(d). 

Array r can be computed efficiently as follows. We loop from d = — n + 1 
to d = n — 1. For each d, we obtain the value r(d) along with its two 
witnesses. 

Initially, we have d = — n+1, r(— n+1) = n; the witnesses w~{— n+1) = 
<5(n — 1, 0) and u> + (— n + 1) = <5(n, 1) can be easily computed in time 0(1). 

Now assume that for a current value of d, we have the value r(d), and 
the witnesses w~(d), w + (d). Our next goal is to compute r(d + 1), along 
with its two witnesses. Let 



w 



S( rW-<^\ mg+±) G [_l : l] 



Value w* can be obtained from either w (d) or w + (d) by Theorem 2 in time 
0(1). We now let 



r(d+ 1) = r(d) + 



1 if w* G [—1 : 0] 
-1 if G [0 : 1] 



If =0, then the choice between 1 and —1 is made arbitrarily. Following 
this choice, we obtain the new witnesses as 

f^ra^.r^) if ^g [-1:0] 

I w* if w* G [0 : 1] 



w + (d+ 1) 



if io* €[-1:0] 
J( r(d) ~ d+1 , r(d) ^ +3 ) if W *€[0:l] 



In each case, the value for the new witness can be obtained from respectively 
w~(d), w + (d) by Theorem 2 in time 0(1). If w* = 0, then the choices are 
made consistently with the arbitrary choice made in the definition of r(d+l). 

The described loop runs until d = n — 1. At this point, we necessarily 
have r(n — 1) = n, w~(n — 1) = 6(0, n — 1) and w + (n — 1) = 8(1, n). The 
whole loop runs in time O(n). 

Given arrays r, w~ , conditions (2.6)-(2.8) can now be expressed as 
follows: 

P c ,io(i,k) = 1 and ? + < r(k-i) (2.9) 
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Pc,hi$M = 1 and i + k > r{k-i) (2.10) 
i + k = r(k — i) and w~ (k — i) = — 1 and w + (k — i) = 1 (2-11) 



The nonzeros of Pc satisfying either of the conditions (2.9), (2.10) can be 
found in time 0{n) by checking directly each of the nonzeros in matrices 
Pc,io an d Pc,hi- The nonzeros of Pq satisfying condition (2.11) can be found 
in time 0(n) by a linear sweep of the values r(d) for all d £ [— n + 1 : n — 1]. 
For each d, we let i = r ^ +d ; ^ = r ^ d ; and substitute these values into 
(2.11). We have now obtained all the nonzeros of matrix Pc- 

(End of recursive step) 

A generalisation to subpermutation matrices is as in Theorem 3. 

Time analysis. The recursion tree is a balanced binary tree of height log n. 
In the root node, the computation runs in time O(n). In each subsequent 
level, the number of nodes doubles, and the running time per node decreases 
by a factor of 2. Therefore, the overall running time is 0(n log n). ■ 

Example 7 Figure 2.3 illustrates the proof of Theorem 8 on a problem 
instance with a solution generated by the Mathematica 7 software. Sub- 
figure 2.3a shows a pair of input 20 x 20 permutation matrices Pa, Pb, 
with nonzeros indicated by green bullets. Subfigure 2.3b shows the parti- 
tioning of the implicit 20 x 20 matrix distance multiplication problem into 
two 10 x 10 subproblems. The nonzeros in the two subproblems are shown 
respectively by red crosses and blue diamonds. Subfigure 2.3c shows a recur- 
sive step. The boundaries separating set <5 _1 ([— 10 : —1]) from e) _1 ({0}), and 
set 5^ 1 ({0}) from <5 -1 ([l : 10]), are shown respectively by the red line and 
the blue line. Function r corresponds to an arbitrary monotone rectilinear 
path within (5 _1 ({0}), i.e. between the red and the blue lines, inclusive of 
the boundaries. In particular, either of the boundaries itself can be taken 
to define r. The nonzeros in the output matrix Pc satisfying (2.9), (2.10), 
(2.11) are shown respectively by red crosses, blue diamonds and green bul- 
lets; note that overall, there are 20 such nonzeros, and that they define a 
permutation matrix. □ 

2.4 Bruhat order 

Let Pa, Pb be permutation matrices over (0 : n \ : n). A classical algebraic 
measure of permutation comparison is given by the following partial order. 

Definition 12 Matrix Pa is lower than matrix Pb in the Bruhat order, if 
Pa can be transformed to Pb by a sequence of steps, each step substituting 
a submatrix of the form ( J 5 ) by a submatrix of the form ( 5 o ) • n 
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Pa Pb 

(a) Input matrices Pa, Pb 

Pb,Io, Pb,m 




Pa,Io, PA,hi Pc,lo + Pc,hi 

(b) Subproblems Pa,i □ Pb,Io = Pc.io and Pa,m □ Pb,m = Pc,m 

Pbjo, Pb , hi 




Pa,Io, Pa,u Pc 

(c) Conversion of Pc,i„ + Pcm into Pc 
Figure 2.3: Proof of Theorem 8: Pa Q Pb = Pc 
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Informally, P4 is lower than Pg, if P4 defines a "more sorted" permutation 
than Pb- More precisely, Pa is lower than Pb, if the permutation defined by 
Pb can be transformed into the one defined by Pa by a comparison network; 
this elegant interpretation of the definition was given by Knuth [83]. 

Bruhat order is a very important group-theoretic concept, which can be 
generalised to an arbitrary Coxeter group. We refer the reader to the mono- 
graph by Bjorner and Brenti [22] for more details and further references. 

Many equivalent definitions of the Bruhat order are known; see e.g. 
Bjorner and Brenti [22], Drake et al. [48], Johnson and Nasserasr [73]. Prob- 
ably the simplest one, referred to in [22] as the "dot criterion" , is as follows. 

Theorem 9 Matrix Pa is below matrix Pb in the Bruhat order, if and only 
if Pa ^ Pb elementwise. □ 

PROOF Straightforward from the definitions; see [22]. ■ 

Theorem 9 immediately gives an algorithm for deciding whether two per- 
mutation matrices are Bruhat-comparable in time 0(n 2 ). To the author's 
knowledge, no asymptotically faster algorithms for deciding Bruhat compa- 
rability have been known. 

As a first application of our results from Section 2.1, we now give a fast 
algorithm for Bruhat comparability, improving on the algorithm based on 
Theorem 9. 

Theorem 10 Matrix Pa is below matrix Pb in the Bruhat order, if and 
only if P% □ P B = Id R . □ 

Proof Let Pa be below P B . Let i,j G [0 : n). We have 

(P4 ) S (i, j) + Pf (j, n-i)= (definition of R) 

(n — i — P4 (j, n — z)) + Pg (j, n — i) = (term rearrangement) 

(P|(j, n - i) - (j, n - i)) + n - i > (Theorem 9) 

n — i 

For j = (and, symmetrically, j = n), this lower bound is attained: we 
have (Pf) s (i, 0) + P|(0, n - i) = + (n - i) = n - i. Therefore, 

(PfQPij) s (i,n-i) = (definition of □) 

min((Pf) s (i, j) + P|(j, n - ij) = (from above) 

j 

n — i 

By (2), this implies that Pf □ P B = Id R . 

Conversely, P% □ P B = Id R implies that for all i, mm,-((Pf ) s (i, j) + 
Pg(j,n — i)) = n — i. From the above, this is equivalent to Pg(j,n — 
i) — P]((j,n — i) > 0, therefore P]((j,n — i) < Pg(j,n — i) for all By 
Theorem 9, this implies that P4 is below Pb in the Bruhat order. ■ 
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In combination with Theorem 8, Theorem 10 gives an algorithm for 
deciding Bruhat comparability in time 0(n log n). 
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Chapter 3 

Semi-local string comparison 



3.1 Semi-local LCS 

We will consider strings of characters taken from an alphabet. No a priori 
assumptions are made on the size of the alphabet and on the model of com- 
putation; we will make specific assumptions in different contexts (e.g. a fixed 
finite alphabet with only equality comparisons, or an alphabet of integers 
up to a given n with standard arithmetic operations, etc.) Two alphabet 
characters a, j3 match, if a = /3, and mismatch otherwise. In addition to 
alphabet characters, we introduce the special non-alphabet character '$', 
which only matches itself and does not match any alphabet characters, and 
the wildcard character 'o', which matches both itself and all the alphabet 
characters. 

It will be convenient to index strings by odd half-integer, rather than 
integer indices, e.g. string a = aia± . ..a„_i. We will index strings as 

2 2 2 

vectors, writing e.g. a(i) = a%, a(i : j) = ct i+ i . . . ct-_ \ . String concatenation 
will be denoted by juxtaposition. 

Given a string, we distinguish between its contiguous substrings, and 
not necessarily contiguous subsequences. Special cases of a substring are 
a prefix and a suffix of a string. Given a string a of length m, we use the 
take/drop notation of [130] for prefixes and suffixes of a: 

a] k = a(0 : k) a\k = a(k : m) 

a\k = a{m — k : m) a [ k = a(0 : m — k) 

Unless indicated otherwise, our algorithms will take as input a string a of 
length m, and a string b of length n. 

A classical approach to string comparison is based on the following nu- 
merical measure of string similarity. 

Definition 13 Given strings a, 6, the longest common subsequence (LCS) 
problem asks for the length of the longest string that is a subsequence of 
both a and b. We will call this length the LCS score of strings a, b. □ 



31 



Example 8 The LCS score of strings a = "baabcbca" , b = "baabcabcabaca" 
is 8; string b contains the whole string CI clS cl subsequence. The LCS score 
of string a against substring 6(4 : 11) = "cabcaba" is 5; this score is realised 
by a common subsequence "abcba". This example, which will serve as a 
running example for this chapter, is borrowed from Alves et al. [7]. □ 

The best known algorithms for the LCS problem run within (model- 
dependent) polylogarithmic factors of 0(mn). We will recall the necessary 
background on LCS algorithms in Sections 4.1, 4.2, 8.3. 

A simple special case of the LCS problem is the (global) subsequence 
recognition problem (also known as the "subsequence matching problem"). 
Given a text string t of length n and a pattern string p of length m < n, 
the problem asks whether t contains the whole p as a subsequence. This is 
equivalent to asking whether the LCS score of t against p is exactly m. The 
global subsequence recognition problem has been considered e.g. by Aho et 
al. [3, Section 9.3], who describe a straightforward algorithm running in time 
0(n). Various extensions of this problem have been explored by Crochemore 
et al. [41]. 

In this work, we generalise the LCS problem to provide a more detailed 
measure of string similarity. 

Definition 14 Given strings a, b, the semi-local LCS problem asks for the 
LCS scores as follows: 

• a against every substring of b (the string-substring LCS scores); 

• every prefix of a against every suffix of b (the prefix-suffix LCS scores); 

• symmetrically, the substring- string LCS scores and the suffix-prefix 
LCS scores, defined as above but with the roles of a and b exchanged. □ 

A traditional distinction, especially in computational biology, is between 
global (full string against full string) and local (all substrings against all 
substrings) comparison. Semi-local comparison lies between these two ex- 
tremes, which explains the terminology A common alternative term used 
in biological texts is "end-free comparison", see e.g. [63, Subsection 11.6.4]. 
It turns out that this is a very natural and useful type of string comparison. 

Many string comparison algorithms output either a single optimal com- 
parison score across all local comparisons, or a number of local compari- 
son scores that are "sufficiently close" to the globally optimal. In contrast 
with this approach, Definition 14 asks for all the locally optimal compar- 
ison scores. This approach is more flexible, and will be useful for various 
applications described later in this work. 

It turns out that, although more general than the LCS problem, the 
semi-local LCS problem can also be solved within (model-dependent) poly- 
logarithmic factors of 0{mn). We will consider semi-local LCS algorithms 
on plain strings in Sections 4.1, 4.2, and on compressed strings in Section 8.3. 
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A special case of the semi-local LCS problem is the local subsequence 
recognition problem, which, given a text t and a pattern p, asks for the 
substrings in t containing p as a subsequence. This problem can also be 
regarded as a basic form of approximate pattern matching. We will consider 
algorithms for local subsequence recognition and other types of approximate 
pattern matching on plain strings in Section 5.2, and on compressed strings 
in Sections 8.4-8.5. 

In certain contexts, such as when one of the input strings is very long, 
we may not wish to deal with all the substrings of the longer string, but still 
to consider the other three components of the semi-local LCS problem. 

Definition 15 Given strings a, b, the three-way semi-local LCS problem 
asks for the string-substring, prefix-suffix and suffix-prefix LCS scores as in 
Definition 14, but excludes the substring-string LCS scores. The same term 
three-way semi-local LCS problem will also refer to the symmetric version of 
the problem, that asks for the prefix-suffix, suffix-prefix and substring-string 
LCS scores, but excludes the string-substring LCS scores. □ 

We will occasionally use the term full semi-local LCS to distinguish the stan- 
dard four-way semi-local LCS problem from its restricted three-way version. 

3.2 Alignment dags and score matrices 

It is well-known that an instance of the LCS problem can be represented by a 
dag (directed acyclic graph) on a rectangular grid of nodes, where character 
matches correspond to edges scoring 1, and mismatches to edges scoring 0. 

Definition 16 An alignment dag is a weighted dag, defined on the set of 
nodes v^i, I G [Iq : h], i G [«0j*i]- The edge and path weights are called 
scores. For all I G [Iq : h], I G (Iq : ii), % G [io,«i], i G (iq : i\), the alignment 
dag contains: 

• the horizontal edge v, -i — > v, } , i and the vertical edge V; i . — >■ 

M 2 , ' ,t 2 '~2' J 

Vf i ., both with score 0; 

• the diagonal edge vti ~i — > vj i ? i with score either or 1. □ 

2' 2 1 '2' 2 

An alignment dag can be viewed as an (Zi — Zo) x (h — *o) grid of cells. An 
instance of the semi-local LCS problem on strings a, b corresponds to an 
m x n alignment dag G a ^ a cell indexed by I G (0 : m), i G (0 : n) is called 
a match cell, if a(l) matches b(i), and a mismatch cell otherwise (recall that 
the strings may contain wildcard characters). The diagonal edges in match 
cells have score 1, and in mismatch cells score 0. Clearly, the diagonal edges 
with score do not affect maximum node-to-node scores, and can therefore 
be ignored. 
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Figure 3.1: Alignment dag G a fi and a highest-scoring path 

Particular examples of an alignment dag are the full-mismatch dag and 
the full-match dag, which consist entirely of mismatch or match cells, re- 
spectively. The dag G a ^ is the full-mismatch dag, when the strings a, b 
have no characters in common. The dag G a ^ is the full-match dag, when 
both strings consist of a single repeated character, or when one of the strings 
consists entirely of wildcard characters. 

Given an instance of the LCS problem on strings a, 6, common string- 
substring, suffix-prefix, prefix-suffix, and substring-string subsequences cor- 
respond, respectively, to paths of the following form in the alignment dag 
G a ,b'- 



VQ,i ~> V m y Vlfi V m> i> VQ,i VVp, Vlfl Vp 



(3.1) 



where 1,1' € [0 : m], i, i' E [0 : n]. The length of each subsequence is equal 
to the total score of its corresponding path. The semi-local LCS problem is 
equivalent to the problem of finding the highest-scoring paths for each of the 
four path types (3.1) and each possible pair of endpoints on the boundary 
of the alignment dag. 

Example 9 Figure 3.1 shows the alignment dag for strings a = "baabcbca" , 
b = "baabcabcabaca" . The highlighted path of score 5 corresponds to 
the string-substring LCS score for string a against substring b{4 : 11) = 
"cabcaba". □ 



Finding the highest-scoring paths is equivalent to finding the correspond- 
ing shortest distances in an undirected graph, obtained from the alignment 
dag by assigning length 1 to vertical and horizontal edges, assigning lengths 
and 2 to diagonal edges in match and mismatch cells respectively, and 
ignoring edge directions. Thus, the problem is equivalent to the problem of 
finding distances between boundary nodes and all nodes on a special case 
of a weighted undirected planar graph. This problem has been previously 
studied by Schmidt [117] on real-weighted alignment dags, and by Klein [81] 
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and Cabello and Chambers [27] on general real-weighted undirected pla- 
nar graphs. In contrast with these approaches, we exploit both the special 
structure of the alignment dag, and the discreteness of the weights. 

The analysis of the four different path types (3.1) can be simplified by 
padding one of the input strings with wildcard characters. Accordingly, we 
need to consider an extended alignment dag for string a over (0 : m) against 
string o m bo m over (— m : m + n). 

Definition 17 Given strings a, b, the corresponding semi-local score ma- 
trix 1 is a matrix over [— m : n | : m + n], defined by 

H a ,b(i,j) = max score (v ,i ~* v m ,j) 

where i G [— m : n], j G [0 : m + n], and the maximum is taken across 
all paths between the given endpoints in the m x (2m + n) alignment dag 
G a ,o m bo m - If i = j, we have H a ^(i,j) = 0. By convention, if j < i, then we 
let H aib (i,j) = j-i < 0. □ 



Example 10 Figure 3.2 shows the matrix Ha^, giving all semi-local LCS 
scores for strings a, b as in the previous examples. The entry H a ^(4, 11) = 5 
is circled. □ 

The solution for each of the four components (3.1) of the semi-local 
LCS problem can now be obtained from the semi-local score matrix H a ^ as 
follows: 

max score(v ,j ~> v md >) = H ajb (j,j') 

max score(v ifl ~> v mJ ,) = H ajb (-i,f) - i 

max score(voj ~~> ««',„) = H a ^(j, m + n — i') — m + i! 

max score(vifi Vi^ n ) = H a ^(—i, m + n — i') — m — i + i! 

where i, i' G [0 : m], j,j' G [0 : n], and the maxima are taken across all paths 
between the given endpoints. 

Special properties of semi-local score matrices have been extensively used 
in algorithm design. These properties are captured by the following theorem. 

Theorem 11 Given strings a, b, the corresponding semi-local score matrix 
H a ^ is unit-anti-Monge. In particular, we have 

H a ,b(i,j) = j - i- Pa, b (i,j) 

1 These matrices are called "DIST matrices" e.g. in [117, 39], and "score matrices" in 
[122] . Our current terminology is chosen to reflect the semi-local score-maximising nature 
of the matrix elements, while avoiding confusion with pairwise substitution score matrices 
used in comparative genomics (see e.g. [74]). 
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Figure 3.2: Matrices H a)b and P a ^ 



where P a)b is a permutation matrix over (— m : n | : m + n) . We also have 

H a>b (i,j)=m-(Pl b f(j,i) 

(note the matrix transposition and the exchange of indices). In particular, 
string a is a subsequence of substring b(i : j) for some i,j G [0 : n], if and 
onlyif(Pl b f{j,i)=Q. 

Proof Let i £ (—m : n), j £ (0 : m + n). Any crossing pair of paths 
d o;+ i v m ~_i and v oi _i ~~> v m ~ + i can be rearranged into a non-crossing 
pair of paths v ot _i v m ~_i and v Q i+ i -w v m ~ + i of the same total score. 

Therefore, we have H^ b (i,j) < 0, hence matrix H a ^ is anti-Monge. 

Let Q a ,b(i-,j) = j — i — H a ^(i,j). From the above, matrix Q a b is Monge. 
Furthermore, we have 

Q a ,b(i -\,m + n) - Q a ,b(i +\,m + n) = 

1 - H a)b (i- 2,m + n) + H a ^{i+ \,m + n) = 1 — m + m = l 

and 

Qa,b{-m,j+ \) - Q a ,b(-m,j- \) = 

1 - H a:b (-m,j+ \) + H a>b (-m,j- |) = l- m + m= l 
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Together, the above properties imply that Q a ^ is a simple unit-Monge ma- 
trix, Therefore, Q a ,b = Pjfhi where P a ^ is a permutation matrix. 

The second part of the theorem statement is now straightforward from 
the definitions. ■ 

Intuitively, Theorem 11 can be interpreted as follows. Let a' be a substring 
of a, and consider its LCS score against string b. If substring a' is extended 
at either end by one character, the LCS score may stay constant, if the new 
character cannot be usefully matched to a character of b; the LCS score 
may also increase by 1, if a useful match can be found. Starting from an 
empty substring a' , any new match is clearly useful. The unit-anti-Monge 
property of matrix H a ^ describes the fact that, as substring a 1 grows in size, 
not all the new matches can be used simultaneously: some useful matches 
will become useless, but a useless match will always remain to be so. The 
"critical points" at which useful matches become useless correspond to the 
nonzeros of the permutation matrix P a ^. However, it should be noted that 
this is only general intuition: in fact, the notion of a useful match is not 
absolute, but depends on the choice of a particular highest-scoring path 
through the alignment dag. 

The proof of Theorem 11 holds more generally for any alignment dag 
with an arbitrary mix of match and mismatch cells, not necessarily defined 
by the matching between any particular pair of strings. However, the given 
form of the theorem will be sufficient for the rest of this work. 

The key idea of our approach is to view Theorem 1 1 as a description of 
an implicit solution to the semi-local LCS problem. The semi-local score 
matrix H a ^ is represented implicitly by the nonzeros of the permutation 
matrix P a ,b- 

Definition 18 Given strings a, 6, the semi-local seaweed matrix is a per- 
mutation matrix P a ^ over (— m : n | : m + n), defined by Theorem 11. □ 

Definition 18 leads to the following interpretation of Theorem 11: the LCS 
score for string a against substring b(i : j) is determined by the number of 
nonzeros in the semi-local seaweed matrix P a ,b, that are ^-dominated by 
(respectively, that ^-dominate) the point (i, j). 

Example 11 Figure 3.2 shows the unit-anti-Monge property of matrix H a ^ 
by a coloured grid pattern, where the red (respectively, blue) lines separate 
matrix elements that differ by 1 (respectively, by 0). The nonzeros of the 
semi-local seaweed matrix P a ^ over (—8 : 13 | : 8 + 13) are shown by green 
bullets. 

The nonzeros of P a ^ that are ^-dominated by the point (4, 11) corre- 
spond to the green bullets lying below and to the left of the circled en- 
try. Note that there are exactly two such nonzeros, and that H a ^(A, 11) = 
11 — 4 — 2 = 5. The nonzeros of P a j, that ^-dominate the point (4, 11) 
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baabcabcabaca 




Figure 3.3: Alignment dag G a ^ and nonzeros of P a £ as seaweeds 

correspond to the green bullets lying above and to the right of the cir- 
cled entry. Note that there are exactly three such nonzeros, and that 
#0,6(4,11) = 8-3 = 5. □ 

A semi-local seaweed matrix can be naturally identified with a seaweed 
braid on m + n seaweeds. As we show in the following section, divide-and- 
conquer solutions to the semi-local LCS problem generally correspond to 
implicit distance multiplication of seaweed matrices, and therefore also to 
multiplication of the corresponding seaweed braids. 

Example 12 Figure 3.3 shows matrix P a ^ as a seaweed braid, laid out 
directly on the alignment dag G a ^. The nonzeros correspond to seaweeds, 
laid out as paths in the dual graph. In particular, every nonzero P a ,b(hj) = 
1, where i, j G (0 : 13), is represented by a seaweed originating between 
the nodes v Q j_ i and v oi+ i, and terminating between the nodes v 8 i and 
u 8j+i- ^he remaining seaweeds, originating or terminating at the sides of 
the dag, correspond to nonzeros P a ,b(hj) = 1> where either i G (—8 : 0) 
or j G (13 : 8 + 13) (or both). In particular, every nonzero P a ,b(hj) = 1> 
where % G (—8 : 0) (respectively, j G (13 : 8 + 13)) is represented by a seaweed 
originating between the nodes i an d o (respectively, terminating 

between the nodes v 8+13 _~_i 13 and v 8+13 _~ + i 13 ). For the purposes of this 
example, the specific layout of the seaweeds between their endpoints is not 
important. However, this layout will become meaningful in the context of 
the algorithms described in the next chapter. 

The full set of 8 + 13 = 21 nonzeros in Figure 3.2 corresponds to the full 
set of 21 seaweeds in Figure 3.3. The two nonzeros that are ^-dominated by 
the point (4, 11) correspond to the two seaweeds fitting completely between 
the two dashed vertical lines i = 4 and j = 11. The three nonzeros that 
^-dominate the point (4, 11) correspond to the three seaweeds piercing both 
these lines. □ 
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With minimal modification, the definitions of score and seaweed matrices 
can also be applied separately to each component of the semi-local LCS 
problem. For that, we split up the definitions as follows. 

Definition 19 Given strings a, b, the corresponding string-substring, prefix- 
suffix, suffix-prefix and sub string -string score matrices are respectively the 
following submatrices of the semi-local score matrix H ab over [— m : n | : 
m + n]: 

H a,b = H a,b[° : n I : n] H* b = tf a , b [0 :n\n:m + n] 

Kb = H a,b[~ m : | : n] fl£ 6 = H afi [-m : | n : m + n] 

Definition 20 Given strings a, 6, the corresponding string-substring, prefix- 
suffix, suffix-prefix and sub string -string seaweed matrices are respectively the 
following subpermutation submatrices of the semi-local seaweed matrix P a ^: 

P^,b = Pa,b(0 :n\0:n) ^ b = P a , 6 (0 : n \ n : m + n) 

Kb = Pa,b(~ m : | : n) P^ 6 = P ,6<-m : | n : m + n) □ 

Note that the ranges of the four score submatrices in Definition 19 are not 
disjoint: they overlap along the boundaries corresponding to i = and 
j = n. In particular, the entry H a j,{Q,n), which gives the LCS score of the 
whole strings a and b, belongs to all four submatrices. The ranges of the 
four seaweed submatrices in Definition 20 are disjoint. 

Example 13 Figure 3.2 shows the partition of H a ,b an d P a ,b in Defini- 
tions 19, 20 by thin dotted lines. The string-substring, prefix-suffix, suffix- 
prefix and substring-string submatrices are respectively on the bottom-left, 
bottom-right, top- left and top- right. The elements of H a ^ lying directly on 
the dotted lines are shared by the bordering submatrices. Note that the 
substring-string submatrices H^ b and P^ b are both trivial; this is due to 
the fact that the whole string a is a subsequence of b. □ 

The nonzeros of each matrix introduced in Definition 20 can be regarded 
as an implicit solution to the corresponding component of the semi-local LCS 
problem. Similarly, the first three and the last three matrices can serve as 
an implicit solution to the respective versions of the three-way semi-local 
LCS problem. The combined number of nonzeros in the first three matrices 
is at least n and at most min(m + n, 2n); note that for m> n, this number 
is independent of m. Analogously, for m < n, the combined number of 
nonzeros in the last three matrices is independent of n. 

Both the full semi-local seaweed matrices, and the component seaweed 
matrices introduced in Definition 20, can be processed into an efficient data 
structure of Theorem 1 for answering individual semi-local score queries. 
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The definitions of score and seaweed matrices are not symmetric with 
respect to the order of the input strings. The precise relationship between 
score matrices H ab , H ba , as well as between their seaweed counterparts, is 
given by the following lemma. 

Lemma 3 Given input strings a, b, we have 

H b , a (i,j) = H a>b (-i,m + n- j) - i + j-n 



for all i G [0 : n] , j G [0 : m] , and 

Pb,a(i'j) = p a,b(-hm + n- j) 
for all i £ (0 : n), j G (0 : m). 
Proof Straightforward by definitions. 



□ 



3.3 Score matrix composition 

We now describe how the previously introduced techniques can be applied 
within a divide-and-conquer framework. 

Let a' , a" , b' , b" be nonempty strings of length m' , m" , n', n" respec- 
tively. We will consider the comparison of a concatenation string a = a' a" 
of length m = m' + m" against a fixed string b of length n; symmetrically, 
the definitions and results will also apply to the comparison of a fixed string 
a of length m against a concatenation string b = b'b" of length n = n' + n" . 

Given a concatenation string a = a'a" , a substring a(i' : i") with i' G [0 : 
m' — 1], i" G [m? + 1 : m] will be called a cross-substring. In other words, 
a cross-substring of a is a concatenation of a nonempty suffix of a' and a 
nonempty prefix of a". A cross-substring that is a prefix or a suffix of a will 
be called a cross-prefix and a cross-suffix, respectively. 

Definition 21 Given strings a = a'a" and 6, the corresponding seaweed 
cross-matrix is the subpermutation matrix 

P$, a »,b = p afi{-™' ■n\0:m" + n) 

Symmetrically, given strings a and b = b'b", the corresponding seaweed cross- 
matrix is the subpermutation matrix 

P^fi'fi" = p a,b{~m :n'\n':m + n) □ 

In contrast with the seaweed matrices in Definition 20, the dimensions of the 
seaweed cross-matrices introduced by Definition 21 depend on the lengths 
of individual component strings in the concatenation (a', a" or b' , b" , re- 
spectively). Hence, the notation for these matrices involves three, rather 
than two, string subscripts. There is a slight abuse of terminology in that 
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it does not distinguish directly the two symmetric versions of the definition; 
however, the versions are distinguished by the notation, and will always be 
clear from the context. Note that P?, is a submatrix of Pf, , , and P?, is 

a,b a' ,a" ,6' a,b 

a submatrix of Pf b , b „ . 

The alignment dag G a ,b consists of alignment subdags G a >^, G a » jb , shar- 
ing a horizontal row of n nodes and n — 1 edges, which is simultaneously the 
bottom row of G a ',b and the top row of G a ",b- We will say that dag G a>b is 
the composite of dags G a ^ b and G a " jb . 

Our goal is, given the semi-local score matrices H a ij>, H a » jb , to compute 
matrix H ab efficiently. We call this procedure score matrix composition. 

By Theorem 11, score matrices H a / jb , Ha",bi H<i,b can be represented 
implicitly by seaweed matrices P a ',b, Pa",b, Pa,b- We argue that score ma- 
trix composition corresponds generally to implicit distance multiplication of 
these seaweed matrices. 

Theorem 12 We have 

P afi = {I m , U P aljb ) □ (P a „ >6 U I m n) (3.2) 

where 

I m i = Id m i(—m : —m! \ —m" : 0) 

I m " = Id m " (n : m! + n\ m" + n : m + n) 

We also have 

P?,b = Pa>,b®^>,b (3-3) 

^aV, 6 = {^,b U P%>,b) H U F°„ t b ) (3.4) 

□ 

Proof By Theorem 11, we have 

H a>b (i, k) = k-i- P% h (i, k) 

for all i G [— m : n], k G [0 : m + n]. Three cases are possible, based on the 
partitioning of the index ranges. 

Case i G [— m! : n\, k G [0 : m + n"]. By Definition 17 and Theorem 11, we 
have 

H a ,b(i, k) = max (H a , h (i, j) + H a n h {j, k)) = 

je[0:n] 

max (j - i - P^ b (i,j) + k-j- P$, b (j, k)) = 

]£[0:n] 

k-i - mm (P$ >b (i,j) + P$, tb U, k)) 
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Therefore, 



= mm 



je[0:n] 



{P$ tb (i,j) + P$, th (j, k)) = (Pj b i$ >6 )(i, k) 



In particular, this holds for i, k € [0 : n]. Hence, we have 



a, 6 



pH m pC 



Case i G [— m : —m'\, k G [0 : m + n]. We have 



#a,&(i, fe) = m' + H a 'i jb (i + m', fe) = 

m' + fe - (i + m') - + m', fc) = 

k — i — Ptf, b (i + m' , k) 



Therefore 



P a s b (i, fc) = i^, j6 (i + m', k) = (I m , □ P a »,6) S (i, k) 



Case i G [— m : n], k £ \m" + n : m + n\. Symmetrically to the previous case, 
we have 



Summarising the above three cases, we have the proof of (3.2). From the 
first case, we also have the proof of (3.3). The proof of (3.4) is similar. ■ 

Example 14 Figure 3.4 shows an instance of semi-local score matrix com- 
position represented by seaweed matrices. The nonzeros in the matrices are 
given by seaweed braids (using an arbitrary layout of individual seaweeds). 
Subfigure 3.4a shows the input matrices P a ',b, Pa",b- Additionally, it shows 
the auxiliary matrices I m /, I m » by dotted seaweeds. Subfigure 3.4c shows 
the output matrix P a>b . □ 

Let n* = mm(m' ,m" ,n) and m* = min(m, n', n"). 

Theorem 13 The implicit distance product (3.2) can be computed in time 
0(m + nlogn*) . Each of the implicit distance products (3.3), (3.4) can be 
computed in time O (nlogn*). □ 

PROOF First, we consider the implicit distance product (3.2). If n* = n, 
then we compute the product directly by Theorem 8. 

We may now assume without loss of generality that n* = m", and that 
^77 > 1 is an integer. It will be convenient to express the proof in terms 
of seaweed braids. Consider a layout of the seaweed braid corresponding to 
Pa",b within the m" x n alignment dag G a ",b- Let us partition this dag into 
-777 square blocks of size m". Clearly, every boundary between two successive 



P* b (i,k) = (P a/jb BI mf/ f(i,k) 
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blocks is crossed by at most m" seaweeds. This property is independent 
of any particular seaweed layout. Given the seaweed matrix P a » it is 
straightforward to lay out the seaweed braids between the block boundaries 
within each block in time 0(m") per block, and therefore in overall time 
^77 • 0(m") = 0(n). Note that this is done with only the matrix P a » 5 6 as 
input, and therefore the resulting layout may not correspond to the original 
alignment dag. The seaweed braid of size m" + n, corresponding to P a ",b, is 
now decomposed into ^77 "staggered" seaweed subbraids, each of size 2m" . 
Recall that by Theorem 12, we have 

P atb = (I m , U P a , tb ) □ (P a « )6 U I m „) 

We compute this product by first performing ^77 successive seaweed braid 
multiplications of size m" , one in each block of G a ii b . Every one of these 
multiplications can be performed by Theorem 8 in time 0(m" log m"). Af- 
ter that, we perform a multiplication by the identity subbraid of size m', 
corresponding to the offset identity matrix I m ". This can be done in time 
0{m') = 0(m). The overall running time is 0(m + ^77 ■ m"logm") = 
0(m + nlog m"). 

The proof for (3.3), (3.4) is analogous. ■ 

Example 15 Figure 3.4 illustrates the proof of Theorem 13 as follows. De- 
composition of the seaweed braid corresponding to matrix P a ",b into blocks 
is shown in Subfigure 3.4a by thin dotted lines. Subfigure 3.4b shows the 
blocks of P a ",b in "staggered" form. Clearly, the whole seaweed braid for 
P 0) 6 can be obtained from P a >^ by successive composition with staggered 
block subbraids of P a ",b, block-by-block. Subfigure 3.4c shows the resulting 
seaweed braid for the composition matrix P a &. □ 

Theorems 12, 13 give a divide-and-conquer solution to the string-substring 
and substring-string LCS problems, which we formulate as follows. 

Corollary 1 Given the nonzeros of matrices P" b , P^„ b , it is possible to 
compute the nonzeros of matrix P° b in time 0(nlogn*). Symmetrically, 
given the nonzeros of matrices P^ b ,, P^yi, it * s possible to compute the 
nonzeros of matrix P^ b in time 0(mlogm*). □ 

Proof The first claim is directly by Theorems 12, 13. The symmetric claim 
follows by Lemma 3. ■ 

Similarly, we obtain a divide-and-conquer solution to both symmetric 
versions of the three-way semi-local LCS problem. 

Corollary 2 Given the nonzeros of matrices 
pa pa pn pE pa pn 

r a',b r a',b r a',b r a" ,b r a" ,b r a",b 
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it is possible to compute the nonzeros of matrices 

pC pQ pD pB 

a,b a,b a,b a',a",b 

in time 0(nlogn*). 

Symmetrically, given the nonzeros of matrices 

pH pD pH pQ pH pH 

-^,6' ^a,b' -^a.ft' r a,b" r af>" r a,b" 

it is possible to compute the nonzeros of matrices 

pQ pC p3 pi 

r a,b r a,b r a,b r a,V ,b" 

in time 0(mlogm*). □ 

Proof To prove the first claim, we compute P^ b and -P°, a „ b in time 0(n log n*) 
by Theorems 12, 13. We now have 

1% = H / m ») U i* >a „ b (0 :n|n:m" + n) 

^ = ^,a», 6 (-^ : | : n) U (j m , □ P*, b ) 

Both these expressions can be computed trivially in time 0(n). The overall 
running time is 0(nlogn*). 

The symmetric claim follows by Lemma 3. ■ 

While the cross-matrices in the statement of Corollary 2 do not contribute 
to the solution of the three-way semi-local LCS problem for a against b, 
they are included for the sake of other applications, described in subsequent 
chapters. 

Finally, we obtain a divide-and-conquer solution to the full semi-local 
LCS problem. 

Corollary 3 Given the nonzeros of matrices P a > b> Pa" bj it ^ s possible to 
compute the nonzeros of matrix P a ^ in time 0(m+n log n*) . Symmetrically, 
given the nonzeros of matrices P a ,b' , Pa,b", it is possible to compute the 
nonzeros of matrix P a ^ in time O (to log to* + n) . □ 

Proof The first claim is directly by Theorems 12, 13. The symmetric claim 
follows by Lemma 3. ■ 
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Chapter 4 

The seaweed method 



4.1 The seaweed algorithm 

A classical solution to the global LCS problem is given by the dynamic pro- 
gramming algorithm, discovered independently by Needleman and Wunsch 
(without an explicit analysis) [104], and by Wagner and Fischer [129]. This 
algorithm assumes a character comparison model that only allows compari- 
son outcomes "equal" and "unequal" , and the unit-cost RAM computation 
model. The algorithm runs in time 0(mn). As a byproduct, it solves the 
LCS problem for all prefixes of input string a against all prefixes of input 
string b. 

A naive algorithm for the semi- local LCS problem runs in time 0[ (m + 
n) 4 ). Based on the ideas of Schmidt [117], Alves et al. [7] gave an algorithm 
for the string-substring LCS problem that runs in time 0(mn), which there- 
fore extends the functionality of the standard dynamic programming LCS 
algorithm, while matching its asymptotic running time. As a byproduct, 
the algorithm solves the string-substring LCS problem for all prefixes of a 
against the whole b. 

We now give a simple algorithm for the semi-local LCS problem, which 
further improves on the functionality of the above algorithms, while still 
matching their model assumptions and asymptotic running time. We call 
it the seaweed algorithm, since it has a simple interpretation in terms of 
seaweed braids introduced in Chapter 1. 

Algorithm 1 (Semi-local LCS: The seaweed algorithm) 
Input: strings a, b of length m, n, respectively. 
Output: nonzeros of semi- local seaweed matrix P a ^. 

Description. It will be convenient to express the algorithm in terms of 
seaweed braids. Informally, the algorithm works as follows. We construct a 
seaweed braid on m + n seaweeds, laid out on the alignment dag G a ^. Every 
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seaweed is traced across the alignment dag in the top-to-bottom or left-to- 
right direction. A seaweed runs in a straight horizontal or vertical line by 
default; however, its direction may be affected either by passing through a 
match cell, or by meeting another seaweed. Every cell has two seaweeds 
passing through it, entering the cell at the top and the left, and leaving it 
at the bottom and the right, not necessarily in that order. 

The layout of the two seaweeds within a cell is decided as follows. In 
a match cell, both seaweeds "bend away" from each other, so the seaweed 
entering at the top exits on the right, and the seaweed entering on the left 
exits at the bottom. In a mismatch cell, the two seaweeds keep straight 
and cross each other, if and only if this pair of seaweeds have not previously 
crossed; otherwise, they bend as in a match cell. Therefore, any given pair 
of seaweeds are only allowed to cross at most once in the course of the 
computation. Notice that the same property of avoiding double-crossings 
also holds for any pair of highest-scoring paths in the alignment dag. 

We now formalise the algorithm as follows. We start with a full- match 
alignment dag Go>»,o", and transform it by incremental cell updates into 
the alignment dag G a ^. We will maintain a variable seaweed braid on m + n 
seaweeds, corresponding to the current alignment dag. We index the starting 
positions of the seaweeds by (— m : n), and the terminating positions by 
(0 : m + n). The current seaweed braid will thus be represented by a variable 
permutation matrix P over (— m : n | : m + n). A nonzero P a ,b(h j) = 1> 
i G (— m : n), j G (0 : m + n), represents a seaweed starting at position l 
and terminating at position j. 

The initial full-match alignment dag Go m ,o n corresponds to an identity 
seaweed braid. Therefore, we initialise 

P <— Id m (—m : n | : m + n) 

The algorithm then sweeps the alignment dag in an arbitrary order com- 
patible with the <C-dominance order of the cells. Recall that every cell in the 
dag is initially a match cell. If the current cell in G at b is also a match cell, no 
update is needed. Otherwise, the current match cell has to be transformed 
into a mismatch cell, resulting in an update on the current seaweed braid, 
which has to be reflected in its representing matrix P. 

Consider a cell defined by the characters a(l), b(i), I G (0 : m), i G (0 : n). 
Let i* = i + m — I. The two seaweeds passing through the current cell 
terminate at positions { i* — ^, i* + ^} = (i* — 1 : i* + 1). We update a 2 x 2 
induced permutation submatrix of P as follows: 

if a{l) + b(i) and P(- \ i* - 1 : i* + 1) = ( J ?) 
then P{- | i* - 1 : i* + 1) <- (JJ) 

By Theorem 11, the algorithm maintains the invariant "current state of ma- 
trix P is the semi-local seaweed matrix for the current state of the alignment 
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Figure 4.1: A snapshot of Algorithm 1 (the seaweed algorithm) 
dag". Therefore, at the end of the sweep, we have the output matrix 

P = Pa,b 

Cost analysis. For every cell, the 2x2 column-induced submatrix P(- \ 
i* — 1 : i* + 1) can be obtained from matrix P in time 0{\). The cell update 
also runs in time 0(1). Therefore, the overall running time is 0{mn). 

The memory cost is dominated by storing the input and the linear rep- 
resentation of the current matrix P. Therefore, the overall memory cost is 
0{m + n). ■ 



Example 16 Figure 4.1 shows a snapshot of Algorithm 1. The dag G a ^ is 
swept in the top-to-bottom, left-to-right lexicographic cell order. The dag 
area that has already been swept is shown by the dark border; the current 
cell is shaded in yellow. Since the two seaweeds meeting in the current 
cell have previously crossed, the current step will leave the seaweed braid 
unchanged, so that double crossing does not occur. The final layout of the 
seaweeds is the one shown in Figure 3.3; it describes the full sequence of 
states of the seaweed braid represented by matrix P in Algorithm 1. □ 

Recall that the dag cells in Algorithm 1 can be processed in any order 
consistent with the <C-dominance partial order. In particular, the cell pro- 
cessing order can be fixed so that the algorithm will compute incrementally 
the semi-local seaweed matrix for all prefixes of string a against whole string 
b. By keeping the algorithm's intermediate data, we obtain a data structure 
that allows efficient LCS queries for every prefix of a against every substring 
of b. As in the classical dynamic programming approach, this data structure 
can be used to trace back (i.e. to obtain character by character) the actual 
LCS corresponding to a prefix-substring LCS query, in time proportional to 
the size of the output (i.e. the length of the output subsequence). Alter- 
natively, a technique similar to memory-efficient dynamic programming by 
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Hirschberg [66] can be applied to achieve prefix-substring LCS traceback in 
the same asymptotic time, but in a linear amount of memory. 

Assuming the "equal/unequal" character comparison model, Aho et al. 
[2] gave a lower bound of fi(mn) on the solution running time of the (global) 
LCS problem (see also a survey by Bergroth et al. [16]). Both the standard 
dynamic programming algorithm, and the seaweed algorithm Algorithm 1 
match this lower bound, and therefore are asymptotically optimal in this 
model. 

4.2 The micro-block speedup 

In the previous section, we assumed the character comparison model that 
only allows comparison outcomes "equal" and "unequal" . We now switch to 
a more powerful character comparison model, assuming that the alphabet 
is a totally ordered set, and comparison outcomes are "less than", "equal" 
and "greater than". In this model, we no longer need to process every dag 
cell individually, so algorithms with running time o{mn) become possible 1 . 

A classical LCS computation speedup originates from a matrix multipli- 
cation method by Arlazarov et al. [11], often nicknamed the "four Russians 
method". In this work, we call it the micro-block method, adopting the 
terminology of Bille and G0rtz [20]. The main idea of the method is to 
sweep the alignment dag in regular micro-blocks of a small, suitably cho- 
sen size, such that running time can be saved by precomputing all possible 
micro-block updates in advance. Without loss of generality, we assume that 
m < n. Using the micro-block method, Masek and Paterson [98] gave an 
algorithm for the (global) LCS problem running in time O ( lo ^" n + n) for 

a constant-size alphabet 2 . An alternative approach to subquadratic LCS 
computation was developed by Crochemore et al. [40]. 

An extension of the micro-block subquadratic LCS algorithm to an unbounded- 
size alphabet, running in time Q( " lw (|°g 2 1 °g w ) + n ), was suggested by Pater- 
son and Dancfk [105], and fully developed by Bille and Farach-Colton [19]. 
In this extension, a second, coarser level of alignment dag partitioning is in- 
troduced. The blocks of this second level, called macro-blocks, are used for 
reducing the effective alphabet size, maximising the number of input string 
characters that fit into a machine word for each micro-block update. 

We now give an algorithm for semi-local LCS running in subquadratic 

lr This holds true even if the computation model assumption is weakened, so that char- 
acter comparisons and arithmetic operations are charged using the log-cost RAM model. 
However, for uniformity we will stick to our original assumption of the unit-cost RAM 
model. 

2 The original algorithm by Masek and Paterson [98] runs in time 0(^^ + n) for a 
constant-size alphabet in the log-cost RAM model. The unit-cost RAM version of the 
algorithm was given in [132, 19]. 
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time, which makes a slight improvement on Algorithm 1. The algorithm 
uses the two-level micro-block method, similar to the global LCS algorithm 
of [19], and matches it in running time. At the same time, our algorithm 
provides a substantially more detailed string comparison. However, in con- 
trast to the global LCS algorithms, our algorithm does not appear to be 
able to take advantage of a bounded alphabet size. 

Algorithm 2 (Semi-local LCS: The seaweed algorithm with micro- 
block speedup) 

Input, output: as in Algorithm 1; we assume m < n. 

Description. Without loss of generality, we may assume that the alphabet 
size is at most 2n, and that the characters are encoded by odd half-integers 
in the range (— n : n). We call two strings of equal length isomorphic, if one 
can obtained from the other by a permutation of the alphabet. 
We process the alignment dag in square micro-blocks of size 

1 = min ( i6.Sg» » m ) 

where the logarithms are base 2. Similarly to Algorithm 1, we start with 
a full- match alignment dag Go m ,o n , and transform it by incremental micro- 
block updates into the alignment dag G a ^. We maintain a variable seaweed 
braid on m + n seaweeds, represented by a permutation matrix P over (— m : 
n | : m + n). As in Algorithm 1, a nonzero P a ,b(hj) = 1, * € (— m : n), 
j € (0 : m + n), represents a seaweed starting at position i and terminating 
at position j. We initialise 

P Id m (—m : n | : m + n) 

The algorithm then sweeps the alignment dag G a ^ in some order compatible 
with the <C-dominance order of the micro-blocks. For each micro-block, we 
perform an update on the current seaweed braid, which has to be reflected 
in its representing matrix P. 

Consider a micro-block defined by the substrings a(l : I + t), b(i : i + 1), 
where I £ [0 : m — h], i £ [0 : n — h]. Let i* = i + m — I. The 2t seaweeds 
passing through the current micro-block terminate at positions (i*—t : i*+t). 
A micro-block can be regarded as a function, parameterised by the current 
input substrings, and performing an update on the 2t x 2t column-induced 
permutation submatrix P{- \ i* — t : i* + t). The states of the submatrix 
before and after the updates will be called the micro-block's input submatrix 
and output submatrix, respectively. Note that both of these are permutation 
matrices, and therefore can be represented implicitly by their nonzeros. 

The alignment dag can be swept in an arbitrary order compatible with 
the <C-dominance partial order of the micro-blocks. Combined with pre- 
computation, this is already sufficient to obtain a subquadratic algorithm. 
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However, in order to achieve higher speedup, we introduce a second level of 
macro-blocks of size 

s = min( log 2 - , m) 

We define a macro-block's input and output submatrices similarly to a micro- 
block's ones. 

The characters of a macro-block's defining input substrings are encoded 
by values in the range (— n : n). The macro-block's input and output sub- 
matrices are represented by the row and column indices of the nonzeros; the 
natural range of these indices is also (— n : n). In order to perform the com- 
putation efficiently, we remap each of these ranges to a smaller range (— s : s) 
before passing the values to the micro-block level. The range remapping pre- 
serves the linear order of the values (both characters and matrix indices). 

We process each macro-block as follows. First, we obtain its defining 
substrings and the input submatrix; overall, we have O(s) characters and 
index values for the macro-block. We then remap both these characters 
and index values from (— n : n) to ( — s : s) by removing 2n — 2s unused 
values from the range, while preserving the relative order of the remaining 
2s values. This remapping requires sorting of the O(s) values, and can be 
performed in time O(slogs). We then sweep the current macro-block in 
an arbitrary order compatible with the <C-dominance partial order of the 
micro-blocks. For each micro-block, we obtain its defining input substrings 
and the input submatrix; overall, we have 0(t) values for the micro-block 
of size t. We then apply a precomputed update to a 2t x 2t column-induced 
permutation submatrix of P as follows: 

P{-\i* -t:i* + t) <- 

update (a(l : l + t),b(i : i + t) , P(- \ i* - t : i* +t)) 

The micro-block's defining substrings and the input state consist each of 2t 
values, ranging over (— s : s). For each of the at most {2s) 2t+2t = (2s) 4 * 
possible combinations of the input character and index values, the output 
index values given by the function update are precomputed in advance, using 
Algorithm 1. 

The algorithm maintains the invariant "current state of matrix P is the 
implicit semi-local score matrix for the current state of the alignment dag" . 
Therefore, at the end of the sweep, we have the output matrix 

P = Pa,b 

Cost analysis. In the precomputation stage, there are at most (2s) 4 ' prob- 
lem instances, each of which runs in time 0(t 2 ). Therefore, the running time 
of the precomputation is 

0((2s) 4t • t 2 ) = 0{{\og 2 n)W^ ■ t 2 ) = 0(2 log(log2n) -TO^ • t 2 ) = 
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baabcabcabaca 




which is negligible, compared to the subsequent main computation stage. 

In the main computation stage, there are micro-block update steps. 
The micro-block's defining substrings and input-output submatrices are each 
represented by 2t values in the range (— s : s). The full micro-block data are 
of size 

0(2t • log(2 S )) = 0(j^ • log(log 2 n)) = O(logn) 

and hence fit into a constant number of machine words. Therefore, the total 
running time of the algorithm is 

^•Q(l) = Q( mn( gJ° t f n)a +n) 

Example 17 Figure 4.2 shows a snapshot of Algorithm 2, using the same 
conventions as Figure 4.1. For simplicity, the macro-blocks are not shown, 
and the micro-blocks are assumed to be of size 2. As in Algorithm 1, the 
final layout of the seaweeds is identical to the one given in Figure 3.3. □ 

4.3 Incremental LCS and semi-local LCS 

The incremental LCS problem was introduced by Landau et al. [89], and by 
Kim and Park [80]. Given a fixed string, the problem asks its LCS score 
against a variable string, which can be updated on-line by either appending 
or prepending a character. An extension, called fully-incremental LCS prob- 
lem, was introduced by Ishida et al. [70]. Here, both strings can be updated 
on-line in a similar fashion. In both versions of the problem, the goal is to 
maintain a data structure that will store the LCS score for the input strings, 
and will allow efficient on-line updates of this score. 
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Let a, b denote the current state of each input string, and m, n their 
respective current size. Landau et al. [89] and Kim and Park [80] gave in- 
cremental LCS algorithms with worst-case time 0{m) per update of string 
b. Ishida et al. [70] extended this result to a fully-incremental LCS algo- 
rithm with worst-case time 0(m) (respectively, O(n)) per update of string 
b (respectively, a). 

We now give an algorithm for the fully-incremental LCS problem, match- 
ing the above algorithms in running time. Our algorithm is a straightforward 
generalisation of the seaweed algorithm (Algorithm 1). The dynamic data 
structure consists of the nonzeros of semi-local seaweed matrix P a ,b- When 
a new character is prepended or appended to string a (respectively, 6), the 
seaweed matrix is updated by processing a new row of cells along the top or 
bottom boundary (respectively, a new column of cells along the left or right 
boundary) of the alignment dag G a ^. 

The same technique also extends to the case where the dynamic data 
structure is required to support, in addition to the global LCS score, also 
semi-local LCS queries. A data structure of Theorem 1, allowing efficient 
semi-local LCS score queries, can be maintained on top of the matrix P a ^ 
at an extra time (9((m + n) log(m + n)) per string update. 

We now give another generalisation of incremental string comparison. 
Consider a fixed string of length n, and a variable pattern string, which can 
be modified on-line by either appending or prepending a block of characters 
from a pre-specified set of admissible blocks. The set of admissible blocks 
is known in advance, and off-line preprocessing of this set is allowed. The 
block-incremental LCS problem asks, as before, to maintain a data structure 
that will store the LCS score for the text against the pattern, and will allow 
efficient on-line updates of this score. 

Consider an individual block update, and let I be the corresponding block 
length. Such an update can be done naively as I single-character updates, 
giving the block update time 0{nl). 

We now give an algorithm for the block-incremental LCS problem, that 
improves on the naive algorithm in running time. The set of admissible 
blocks is preprocessed off-line by computing the string-substring seaweed 
matrix for every admissible block against the fixed string. The preprocess- 
ing runs in time O(n-L), where L is the total length of the admissible blocks; 
if the admissible blocks are sufficiently long, the micro-block speedup (Algo- 
rithm 2) can be applied. Given the precomputed string-substring seaweed 
matrices, an individual string update can be processed in time O(nlogZ) by 
Corollary 1. 



53 



4.4 Common-substring LCS and semi-local LCS 



The common- sub string LCS problem was introduced by Landau and Ziv- 
Ukelson [91] (see also [39]). Given a text string t of length n and an unspec- 
ified number of pattern strings, the problem asks for the LCS score of the 
text against each of the patterns. The pattern strings may share a known 
common substring c of length Z; we assume I < n. A pattern string may con- 
tain several copies of the common substring; the locations of all the copies 
are known in advance. The goal is, given the text, to preprocess the common 
substring so as to minimise the LCS computation time for each occurrence 
of the common substring in the patterns. Time 0{n) per pattern character 
is allowed outside any occurrences of c. 

The problem can be solved naively by computing the LCS score for the 
text against each of the patterns, ignoring the common-substring structure. 
The resulting algorithm does no preprocessing, and runs in time 0(nl) per 
occurrence of the common substring. 

An improved algorithm was given by Landau and Ziv-Ukelson [91]. This 
algorithm, following some preprocessing in time 0(nl), runs in time 0(n) 
per occurrence of the common substring. 

We now give an algorithm for the common-substring LCS problem, that 
matches the above algorithm in both preprocessing and running time, but 
has the potential for a micro-block speedup in the preprocessing phase. We 
preprocess the common substring c by computing the nonzeros of the string- 
substring seaweed matrix Pf t . The preprocessing runs in time 0(nl); if the 
common substring length I is sufficiently large, the micro-block speedup 
(Algorithm 2) can be applied. For every pattern string p, we now compute 
the vector h Pt t = Hp t (0 | *) over (0 : n). This vector can be computed 
incrementally by repeated application of Theorem 6. Each vector update 
takes time 0(n) per occurrence of the common substring. 

The common-substring LCS problem can be generalised to the semi-local 
common-substring LCS problem. As in the ordinary semi- local LCS problem, 
string-substring, substring-string, prefix-suffix and suffix-prefix LCS score 
queries are now allowed between the text and each of the patterns. 

This problem can be solved naively by computing the implicit semi- 
local seaweed matrix for the text against each of the patterns, ignoring the 
common-substring structure. The resulting algorithm does no preprocessing, 
and runs in time 0{nl) per occurrence of the common substring. 

We now give an algorithm for the semi-local common-substring LCS 
problem, improving on the naive algorithm in running time. We prepro- 
cess the common substring c in time 0(nl) by computing the nonzeros of 
the semi- local seaweed matrix P c ,t- As before, the preprocessing runs in 
time 0(nl); if / is sufficiently large, the micro-block speedup can be ap- 
plied. For every pattern string p, the semi-local seaweed matrix P Pjt can 
now be computed incrementally, starting from an arbitrary occurrence of c 
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in p. The resulting algorithm can be regarded as a special case of the block- 
incremental LCS algorithm from Section 4.3. Each incremental update takes 
time 0{n log/) per occurrence of the common substring. Overall, the algo- 
rithm takes time O(n) for the first occurrence of the common substring in 
a pattern, and time 0(n log/) for each of its subsequent occurrences in the 
same pattern. In particular, if the common substring occurs only once in 
every pattern string, our algorithm improves on the algorithm of [91, 39] in 
functionality and preprocessing time, without any increase in the asymptotic 
update time. 



4.5 Cyclic LCS 

Given strings a, b of length m, n respectively, the cyclic LCS problem asks 
for the highest LCS score between a and all cyclic shifts of b (or, equivalently, 
all cyclic shifts of a and all cyclic shifts of b). 

Cyclic string comparison has been considered by Maes [95], Bunke and 
Biihler [25], Landau et al. [89], Schmidt [117], Marzal and Barrachina [97]. 
Works [89, 117] give algorithms that solve the cyclic LCS problem in worst- 
case time 0(mn). 

We now give a new algorithm for the cyclic LCS problem, improving 
on the existing algorithms in running time. Without loss of generality, we 
assume m < n. First, we call Algorithm 2 on strings a and bb (a con- 
catenation of string b with itself), obtaining the semi- local seaweed matrix 
P a ,bb- Then, we perform n string-substring LCS score queries for a against 
every substring of bb of length n; this can be done efficiently via Theorem 2. 
Finally, we take the maximum score among all the queries. The overall 
running time is dominated by the call to Algorithm 2, which runs in time 

O / rrm (log log n) 2 \ 
( log 2 n + n )- 



4.6 Longest repeating subsequence 

Given a string a of length n, the longest repeating subsequence problem asks 
for the length of the longest subsequence of a that is a square, i.e. a con- 
catenation of two identical strings. 

This problem has been considered under the name "longest tandem scat- 
tered subsequence problem" by Kosowski [84], who gave an algorithm run- 
ning in time 0(n 2 ). 

We now give a new algorithm for the longest repeating subsequence 
problem, improving on the existing algorithm in running time. First, we 
call the seaweed algorithm with the micro-block speedup (Algorithm 2) on 
string a against itself, obtaining the semi-local seaweed matrix P aA in time 
Q^ n (log log n)' y Then, we perform n — 1 prefix-suffix LCS score queries for 
every possible non-trivial prefix-suffix decomposition of a; this can be done 
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in time O(l) per query by Theorem 2. Finally, we take the maximum score 
among all the queries. The overall running time is Q( - ^^"^ ). 
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Chapter 5 

Weighted string comparison 



5.1 Weighted scores and edit distances 

The concept of LCS score is generalised by that of (weighted) alignment 
score (see e.g. [71]). An alignment of strings a, b is obtained by putting 
a subsequence of a into one-to-one correspondence with a (not necessarily 
identical) subsequence of b, character by character and respecting the index 
order. The corresponding pair of characters, one from a and the other from 
b, are said to be aligned. A character that is not aligned against a character 
of another string is said to be aligned against a gap in that string. Each of 
the resulting character alignments is given a real weight: 

• a pair of aligned matching characters has weight w @ > 0; 

• a pair of aligned mismatching characters has weight w® < w @ ; 

• a gap-character or character-gap pair has weight w e < \w is , it is 
normally assumed that w e < 0. 

The intuition behind the weight inequalities is as follows: aligning a match- 
ing pair of characters is always better than aligning a mismatching pair of 
characters, which in its turn is never worse than leaving both characters 
unaligned. 

Definition 22 The (weighted) alignment score for strings a, b is the maxi- 
mum total weight of character pairs in an alignment of a against b. □ 

Example 18 The LCS alignment score is given by 

W® = 1 7% = W e = 

A slightly more sophisticated alignment score, intended to penalise gaps in 
DNA sequence alignment, is given by 

= 1 w® = w e = —0.5 



57 



Another alignment score used for DNA sequence comparison [33, Section 
1.3] is given by 



= 2 w® = -1 w e = —1.5 □ 

We define the semi-local and the three-way semi-local (weighted) alignment 
score problems by straightforward extension of Definitions 14 and 15. A 
semi-local alignment score corresponds to a highest-scoring path in a weighted 
alignment dag, where diagonal match edges, diagonal mismatch edges, and 
horizontal and vertical edges have weight w @ , w®, w e , respectively. The 
output of the semi-local alignment score problem is a semi-local (weighted) 
score matrix; to distinguish such matrices from (unweighted) LCS score 
matrices, we will use a script font, e.g. T-L a ,b- A semi-local (weighted) score 
matrix is anti-Monge; however, in contrast with the unweighted case, it is 
not necessarily unit-anti-Monge. 

When comparing globally a fixed pair of strings, it is convenient to nor- 
malise the weights so that w 9 = 1, assuming that originally w 9 ^ 0. More 
generally, we only need to consider alignment scores with = w e < w® < 

= 1. (A similar observation was made by Rice et al. [110]; see also 
[64, 74].) Indeed, given any unrestricted original weights w®, i%, w e , we 
can transform them to normalised weights: 

«, e = l «,. = — „ e =0 (5.1) 

We call the corresponding alignment score the normalised score. 

Example 19 In Example 18, the LCS score is already normalised. The 
other two scores correspond to the normalised scores with weights «;* = 1, 
u>* = 0.5 and u>* = 0.4 respectively, and = 0. □ 

The original alignment score h can be restored from the normalised score 
h* as 

h = h* ■ (w® — 2w e ) + (m + n) ■ w e (5.2) 

For fixed string lengths m and n, maximising the normalised score h* is 
equivalent to maximising the original score h. However, more care is needed 
when maximising the score for variable string lengths, e.g. the semi-local 
alignment score. In such cases, an explicit conversion from normalised 
weights to original weights will be necessary prior to the maximisation. 

In this work, we will mostly restrict ourselves to character alignment 
weights that satisfy the following rationality condition. 

Definition 23 A set of character alignment weights will be called rational, 
if all the weights are rational numbers. □ 
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Figure 5.1: Alignment dag G- r for the blown-up strings 



Given a rational set of normalised weights, the semi- local alignment score 
problem on strings a, b can be reduced to the semi-local LCS problem by 
the following blow-up procedure. Let w® = £ < 1, where fx, v are positive 
natural numbers. We transform input strings a, b of lengths m, n into new 
blown-up strings a, b of lengths rh = vm, n = vn. The transformation 
consists in replacing every character 7 in each of the strings by a substring 
$ flr y u ~^ of length v. Here, '$' stands for a special character not present in 
the original strings. We have 

for all i S [— m : n], j G [0 : m + n], where the matrix H a ,b is defined by 
the normalised weights on the original strings a, b, and the matrix H~ r by 

the LCS weights on the blown-up strings a, b. Therefore, all the techniques 
of the previous sections apply to the rational-weighted semi-local alignment 
score problem, assuming that v is a constant. 

Example 20 Figure 5.1 shows the alignment dag for the blown-up strings 
a, b. We assume the normalised alignment weights w 9 = 1, = 0.5, 
w e = 0, hence v = 2. The highlighted path of score 5.5 corresponds to 
the string-substring weighted alignment score for string a against substring 
6(4 : 11) = "cabcaba" . □ 

An important special case of weighted string alignment is the edit dis- 
tance problem. Here, the characters are assumed to match "by default": 
w 9 = 0. The mismatches and gaps are penalised: 2w e < < 0. The 
resulting score is always nonpositive. It is traditional we regard string a as 
being transformed into string b by a sequence of weighted character edits: 

• character insertion or deletion (indel) has weight — w e > 0; 

• character substitution has weight -ro 8 > 0. 
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Definition 24 The (weighted) edit distance between strings a, b is the min- 
imum total weight of a sequence of character edits transforming a into b. 
Hence, it is the nonnegative opposite of the corresponding alignment score. □ 

The edit distance is a metric: it is nonnegative, positive except on equal 
strings, symmetric, and satisfies the triangle inequality. 

Example 21 The indel distance (also called the LCS distance) [104, 10] has 
indel weight 1, and substitution weight 2, making a substitution equivalent 
to an insertion-deletion pair. The corresponding indel alignment score is 
given by 

w 9 = w ls = -2 w e = -l 

The indelsub distance (also called the Levenshtein distance) [92] has both 
indel weight and substitution weight equal to 1. The corresponding indelsub 
alignment score is given by 

w® = w® = w e = -1 □ 

The definition of edit distance can be generalised by allowing insertions 
and deletions to have two separate, distinct weights. For example, the asym- 
metric episode distance [43] corresponds to insertion weight 0, and strictly 
positive deletion and substitution weights. 

In the rest of this work, the edit distance problem will be treated as 
a special case of the weighted alignment problem. In particular, all the 
techniques of the previous sections apply to the semi-local edit distance 
problem, as long as the character edit weights are rational. 

5.2 Approximate pattern matching 

Approximate pattern matching is a natural generalisation of both the or- 
dinary (exact) pattern matching, and of the alignment score and the edit 
distance problems. Given a text string t of length n and a pattern string 
p of length m < n, the approximate pattern matching problem asks to find 
the substrings of the text that are locally closest to the pattern, i.e. that 
have the locally highest alignment score (or, equivalently, lowest edit dis- 
tance) against the pattern. The precise definition of "locally" may vary in 
different versions of the problem, and will typically correspond to a certain 
set of maxima (global, local, row, column etc.) in the string-substring score 
matrix H^ f . 

The most general form of approximate pattern matching is as follows. 

Definition 25 The complete approximate matching problem assumes an 
alignment score with arbitrary weights, and for every suffix of text t, asks 
for a prefix of this suffix that has the highest alignment score against pattern 
p. This corresponds to the set of all row maxima in the matrix Hp t . □ 
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The complete approximate pattern matching problem can be solved by 
a classical algorithm by Sellers [118] in time O(mn). Assuming a rational 
set of weights, the micro-block method gives an algorithm running in time 
O ( lo ™" TO + n) for a constant-size alphabet, and in time O ( mn ^^°^ — \~ n ) 
for an unbounded-size alphabet. Various extensions of the problem have 
been considered by Cormode and Muthukrishnan [38] and many others (see 
e.g. a survey by Navarro [103] and references therein). 

The complete approximate pattern matching problem has output = 
0(n). Most of this output may be redundant, if we are only interested 
in the areas of sufficiently high similarity. This motivates the introduction 
of a similarity threshold. Given a matrix A and a threshold h, it will be 
convenient to denote the subset of entries above the threshold by 

T h (A) = such that A(i,j) > h] 

Definition 26 The threshold approximate matching problem (often called 
simply "approximate matching") assumes an alignment score with arbitrary 
weights, and, given a threshold score h, asks for all substrings of text t that 
have alignment score at least h against pattern p. This corresponds to all 
points in the set Th(Hp t ). □ 

An important special case of the threshold approximate matching prob- 
lem is given by the unweighted LCS score, and the maximum possible thresh- 
old h = m. 

Definition 27 The local subsequence recognition problem (also known as 
the "episode matching problem") asks for all substrings in text t contain- 
ing pattern p as a subsequence. This corresponds to all points in the set 
T m (Hp t ). If this set is nonempty (i.e. if p is contained in the whole t as a 
subsequence), it is the set of all global maxima in the matrix H^ t . □ 

The substrings asked for by Definitions 26-27 will be called matching sub- 
strings. 

Although we have now concentrated our search on the areas of high 
similarity, the output of these approximate matching problems may still be 
highly redundant. In particular, the output of the threshold approximate 
matching is likely to contain highly overlapping matching substrings, where 
the starting and/or ending positions only differ by a few characters. In 
the more extreme case of local subsequence recognition, any superstring 
of a matching substring will also be matching. The usual convention for 
eliminating such redundancy is to filter the output, by retaining only a subset 
of the matching substrings. The full output can be efficiently reconstructed, 
if necessary, from the filtered one. For instance, the filtered output may 
retain: 
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• unique starting (or, symmetrically, ending) positions of matching sub- 
strings; 

• inclusion-minimal matching substrings; 

• matching substrings of a fixed length w. 

In particular, the minimal-window subsequence recognition problem asks for 
all inclusion-minimal substrings in the text containing the pattern as a sub- 
string. This corresponds to all ^-minimal points in the set T m (H^ t ). The 
fixed-window subsequence recognition problem, given a window length w, 
asks for all substrings of length w of the text containing the pattern as a 
substring. This corresponds to all points lying on the intersection of the 
diagonal j — i = w with the set T m (H^ t ). 

In the results we cite, the filtering is usually left implicit, and can be 
assumed to be of one of the types introduced above. 

The local subsequence recognition problem has been considered by Das 
et al. [43]. For both the minimal- window and fixed- window versions, they 
give an algorithm running in time O ( lo ™" ra + n) for a constant-size alphabet, 

which can be modified to an algorithm running in time Qf mn (^°^°s n ) — 1_ n \ 

° ° \ log z n > 

for an unbounded-size alphabet. A multi-pattern version of the problems 
has been considered by Cegielski et al. [29] . 

The threshold approximate pattern matching problem has also been con- 
sidered under the indelsub alignment score, given by alignment weights 
u> ffi = 0, w® = w e = —1. This form of approximate string matching, 
with the threshold score h < 0, is usually defined in terms of the corre- 
sponding threshold edit distance k = —h > 0, and is traditionally known 
as approximate matching with k differences. When the threshold k is low, 
the best known algorithm is by Cole and Hariharan [37], running in time 
0[m + n + ^-). For higher values of k, the best known algorithm is by 
Landau and Vishkin [90], running in time 0(nk). 

We now give a new unified algorithm for approximate pattern matching, 
applicable to any of the problem's versions described by Definitions 25-27, in 
the case of rational weights. Our algorithm matches the micro-block version 
of Sellers' algorithm in running time for an unbounded-size alphabet. The 
algorithm's running time does not include the distance threshold k as a 
parameter. 

The new algorithm is as follows. First, we call Algorithm 2 on strings p, t 
(if necessary, using the blow-up technique of Section 5.1), obtaining the semi- 
local seaweed matrix P p j. By Theorem 1, we then build a data structure 
that allows to query any element of the semi-local score matrix H p j in 
polylogarithmic time. Since matrix H p j is anti-Monge, all the row maxima 
can now be found efficiently by the algorithm of Lemma 1, which solves 
the complete approximate matching problem. The threshold approximate 
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matching problem can now be solved by considering the matrix entries in the 
neighbourhood of the row maxima. Both the minimal-window and the fixed- 
window versions of the local subsequence recognition problem can be solved 
by selecting the row maxima that satisfy the additional filtering conditions. 

In all the described versions of the algorithm, the overall running time is 
dominated by the call to Algorithm 2, which runs in time 0^ " m (|°g 2 1 °g") — hn) . 
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Chapter 6 

Periodic string comparison 



6.1 The periodic seaweed algorithm 

In many string comparison applications, one or both of the input strings 
may have periodic structure. In this chapter, we show how to exploit such 
structure efficiently, using a variant of the seaweed method. 

Consider the problem of comparing a finite string a of length m against 
a string b, which is infinite in both directions and periodic: b = u ±OD = 
. . . uuuu . . . The period string u is finite of length p. 

Definition 28 Given strings a, u, the periodic string-substring LCS prob- 
lem asks for the LCS score of a against every finite substring of b = u ±OQ .n 

Without loss of generality, we assume that every character of a occurs in 
u at least once. Clearly, the length of the substring of b in Definition 28 can 
be restricted to be at most mp (for longer substrings of b, every character of 
a can be matched to a different copy of u within the substring, and therefore 
the string-substring LCS score will be equal to m). 

The definition of the alignment dag (Definition 16) extends naturally 
to the periodic string-substring LCS problem. The alignment dag for such 
a problem is itself periodic: the edges v n _i +kp — > v li+ i +kp (respectively, 

«i-i,i+fcp -> v i+y+k P > Wi_i,i-i+*p v i+i,i+i + k P ) have ef i ual scores for a11 

l G [lo : h], I G (lo : h), i G [io,h], i G (io : i\), k G [— oo : +oo]. Such 
an alignment dag can also be regarded as a horizontal composition of an 
infinite sequence of period subdags, each of which is isomorphic to the mx p 
alignment dag G a ,u- 

Since string b is infinite, the semi-local score and seaweed matrices can be 
understood as just the respective string-substring matrices: matrix H a b = 
H^ b is over [— oo : +oo], and matrix P a ^ = P° fe is over (— oo : +oo). Fur- 
thermore, matrices H a ^, P a ,b are again periodic: we have 

H a ,b(i,j) = H a ,b{i + P, j + P) 
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Pa,b(hj) = Pa,b{i + P,J + P) 



for all i,j £ [—00 : 00], (—00 : 00). To represent such matrices, it is suf- 
ficient to store the p nonzeros of the row-period submatrix P a ,b{0 ■ p | •), or, 
symmetrically, of the column-period submatrix P a ,b{' I : p). The nonzero 
sets of the two period submatrices can be obtained from one another in 
time 0(p). When working with an infinite periodic seaweed matrix, we will 
assume such a representation implicitly. For example, accessing a column 
Pa,b(*, i) will correspond to accessing all columns P a ,b(*, with 1 = i' mod p. 

For the periodic semi-local LCS problem, the seaweeds only need to be 
traced within a single period subdag, with appropriate wraparound. There- 
fore, the problem can be solved by the following variant of the seaweed 
algorithm (Algorithm 1). 

Algorithm 3 (Periodic string-substring LCS: The periodic seaweed 
algorithm) 

Input: strings a, u of length m, p, respectively; we have b = u ±0 ° . 

Output: nonzeros of semi- local seaweed matrix P a ^, represented by nonze- 
ros of (say) column-period submatrix P a ,b{' I : p). 

Description. Similarly to Algorithm 1, we start with a full- match align- 
ment dag Gom oioo, and transform it by incremental cell updates into the 
alignment dag G a ,b- We will maintain an infinite periodic seaweed braid, 
corresponding to the current alignment dag. The current seaweed braid will 
be represented by a variable permutation matrix P over (—00 : +00 | —00 : 
+00), with period p; implicitly, we will assume a column-period submatrix 
representation. A nonzero P a ,b(h3) = 1, j £ (—00 : +00), represents a 
seaweed starting at position i and terminating at position j. 

The initial full-match alignment dag G m ±oo corresponds to an identity 
seaweed braid. Therefore, we initialise 

P <— Id m ( — 00 : +00 j —00 : +00) 

which is clearly a periodic matrix for any period, including p. 

The algorithm then sweeps the period subdag in the following order. 
In the outer loop, we run through the rows of cells top-to-bottom. For 
the current row I £ (0 : m), we start the inner loop at an arbitrary index 
iq G (0 : p), such that a(l) = b(i), hence the corresponding cell in G a j, is a 
match cell. Such an index iq is guaranteed to exist by the assumption that 
every character of a occurs in u at least once. Then, starting from i = iq, 
we sweep the cells left-to-right, wrapping around from i = p — ^ to « = |, 
and continuing the sweep left-to-right up to i = (?o — 1) mod p. 

Consider a cell defined by the characters a(l), b(i) = u(i), I G (0 : m), 
i G (0 : n). Let i* = i + m — I. The two seaweeds passing through the 
current cell terminate at positions {i* — \, i* + ^} = (i* — 1 : i* + 1). As 
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baabcabcabaca 




Figure 6.1: A snapshot of Algorithm 3 (the periodic seaweed algorithm) 

in Algorithm 1, we update a 2 x 2 induced permutation submatrix of P as 
follows: 

if a(l) ^ b(i) and P(- | i* - 1 : i* + 1} = ( J ?) 
then P(. | + <- (§ J) 

Recall that conceptually, this update is performed on an infinite periodic set 
of columns of P. However, in a column-period submatrix representation, it 
is sufficient to update just a single 2x2 submatrix. Note that the first 
update in an inner loop is always trivial: we have a(l) = b(io), therefore P 
remains unchanged. 

At the end of the sweep, we have the output matrix 

P = Pa,b 

Cost analysis. Similarly to Algorithm 1, every cell processing step runs 
in time 0(1). Therefore, the overall running time is 0{mp). 

The memory cost is dominated by storing the input and the period 
submatrix of the current matrix P. Therefore, the overall memory cost is 
0(m + p). ■ 

Example 22 Figure 6.1 shows a snapshot of Algorithm 3, using the same 
conventions as Figure 4.1. The seaweed braid is laid out on a period subdag 
G a ,u', a seaweed leaving the subdag on the right is assumed to wrap around 
and enter the next period subdag at the corresponding point on the left. 
Note that the two seaweeds meeting in the current cell have both wrapped 
around from the previous period subdag, where they did cross. Therefore, 
the current step will leave the seaweed braid unchanged, so that double 
crossing does not occur. □ 

The dag sweeping order in Algorithm 3 is significantly more restricted 
than in Algorithm 1, due to the extra data dependencies caused by the 
wraparound. This seems to rule out the possibility of a micro-block version 
of the algorithm. 
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6.2 Tandem alignment 



The periodic LCS problem has many variations that can be solved by an 
application of the periodic seaweed algorithm. 

The first such variation is the tandem LCS problem. The problem asks 
for the LCS score of a string a of length m against a tandem fc-repeat string 
b = u k of length n = kp. As before, we assume that every character of a 
occurs in u at least once; we may also assume that k < m. 

The tandem LCS problem can be solved naively by considering the LCS 
problem directly on strings a and b, ignoring the periodic structure of string 
b. The standard dynamic programming algorithm (see Section 4.1) solves 
the problem in time 0(mn) = 0{mkp). This running time can be slightly 
improved by the micro-block method (see Section 4.2). 

The tandem LCS problem can also be regarded as a special case of the 
common-substring LCS problem [91, 39] (see also Section 4.4). Using this 
technique, the problem can be solved in time 0(m(k + p)). The techniques 
of Landau et al. [39, 88] give an algorithm for the tandem LCS problem, 
parameterised by the LCS score of the input strings; however, the worst-case 
running time of this algorithm is still 0(m(k+p)). Landau [87] asked if the 
running time for the tandem LCS problem can be improved to 0(m(logk + 

p)Y 

We now give an algorithm that improves on the current algorithms in 
time and functionality, and even exceeds Landau's expectation. First, we 
call Algorithm 3 on strings a and u. Then, we count the number of nonzeros 
^-dominated by point (0,n), i.e. nonzeros in the submatrix P a fi(0 : +oo | 
— oo : n). Given the (say) column-period submatrix P a &(0 : p | •), this 
can be done by a sweep of its p nonzeros, counting every nonzero with 
appropriate multiplicity. More precisely, every nonzero P a ,b{hj) = 1, * € 
(0 : p), j € ( — oo : oo), is counted with multiplicity k — \J/p\ , if j G (0 : n), 
and is skipped (counted with multiplicity 0) otherwise. The solution to the 
tandem LCS problem is then obtained by Theorem 11. The overall running 
time is dominated by the call of Algorithm 3, which runs in time 0(mp). 

Another set of variations on the periodic LCS problem was introduced 
by Benson [14] as the tandem alignment problem. Instead of asking for all 
string-substring LCS scores of a against b = u ±oa , the tandem alignment 
problem asks for a substring of b that is closest to a in terms of alignment 
score (or edit distance), under different restrictions on the substring. In 
particular: 

• the pattern global, text global (PGTG) tandem alignment problem re- 
stricts the substring of b to consist of a whole number of copies of u, 
i.e. to be of the form u k = uu . . .u for an arbitrary integer k; 

• the tandem cyclic alignment problem restricts the substring of b to be 
of length kp for an arbitrary integer k (but it may not consist of a 
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whole number of copies of it); 

• the pattern local, text global (PLTG) tandem alignment problem leaves 
the substring of b unrestricted. 

All three versions of the tandem alignment problem can be regarded as 
special cases of the approximate pattern matching problem (see Section 5.2) 
on strings a of length m and b' = u m of length n = mp (but with the roles 
of the text and the pattern reversed with respect to Benson's terminology). 
Therefore, the tandem LCS problem can be solved naively by considering 
the approximate pattern matching problem directly on strings a and b 1 , 
ignoring the periodic structure of string b' . Given an arbitrary (real) set of 
alignment weights, the classical algorithm by Sellers (see Section 5.2) solves 
the problem in time 0(mn) = 0{m 2 p). For a rational set of weights, the 
running time can be slightly improved by the micro-block precomputation 
method (see Section 5.2). 

The PGTG and PLTG tandem alignment problems can be solved more 
efficiently by the technique of wraparound dynamic programming [101, 53] 
(see also [14]) in time 0{mp). For the tandem cyclic alignment problem, 
Benson [14] modified this technique to give an algorithm running in time 
0(mp log p) and memory 0{mp). 

We now give a new algorithm for the tandem cyclic alignment prob- 
lem, which improves on the existing algorithm in running time, assuming a 
rational set of alignment weights. The running time of the new algorithm 
matches the current algorithms for the PGTG and PLTG tandem alignment 
problems. 

Given input strings a, u, we first solve the periodic string-substring prob- 
lem by calling Algorithm 3. This gives us a period submatrix of matrix P a ,b, 
where b = u ±0 °. Then, for each k, < k < m, we perform independently the 
following procedure. We solve the tandem LCS problem for strings a and 
u k by the method described earlier in this section, counting every nonzero 
in the period submatrix P a ^ with an appropriate multiplicity. This gives us 
the LCS score for a against u k for every k. We then update this score incre- 
mentally, obtaining the LCS score for string a against a window of length p 
in b, sliding through p successive positions. This is equivalent to querying 
p successive elements in a diagonal of matrix P a b, which can be achieved 
by 2p incremental dominance counting queries. By Theorem 2, every one of 
these queries can be performed in time 0(1). 

The call to Algorithm 3 runs in time 0(mp); its output is shared by the 
tandem LCS computation for all k. For each k, the running time of the 
remaining computation is 0(p). Therefore, the combined running time for 
all values of k is m ■ 0(p) = 0(mp). Overall, the algorithm runs in time 
0{mp). 
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Chapter 7 

Permutation string 
comparison 



7.1 Semi- local LCS between permutations 

An important special case of string comparison is where each of the input 
strings a, b is a permutation string, i.e. a string that consists of all distinct 
characters. Without loss of generality, we may assume that m = n, and 
that both strings are permutations of a given totally ordered alphabet of 
size n. The semi-local LCS problem on permutation strings is equivalent to 
the following classical problem. 

Definition 29 Given a string a, the longest increasing subsequence (LIS) 
problem asks for the length of the longest string that is an increasing sub- 
sequence of a. □ 

The LIS problem has a long history, going back to Erdos and Szekeres [50] 
and Robinson [112]. Later, Knuth [82], Fredman [55] and Dijkstra [47] gave 
algorithms running in time O(nlogn). The problem was studied further by 
Chang and Wang [32] and by Bespamyatnikh and Segal [17]. 

The semi-local LCS problem on permutation strings is equivalent to 
solving the LIS problem in every substring of a given string. In the rest 
of this section, we give an efficient algorithm for this problem. 

For consistency with the notation in previous chapters, we will assume 
that a permutation string of length n is indexed by odd half-integers (0 : n), 
and is over the alphabet (0 : n), unless indicated otherwise. The identity 
permutation string of length n is the string id = ^, . . . ,n — 

Given a string a, we denote its reverse string by a. In particular, the 
reverse identity permutation string is id = (n— \,n— |, . . . , ^) . We denote 
by S(a) the set of characters appearing in a at least once. For a set of 
characters S, we denote by a\s the filtered subsequence of a, which consists 
only of those characters that belong to S. 
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Algorithm 4 (Semi-local LCS between permutation strings) 
Input: permutation strings a, b of length n over an alphabet of size n. 
Output: nonzeros of the semi-local seaweed matrix P a ^. 
Description. Recursion on n. 
Recursion base: n = 1. The computation is trivial. 

Recursive step: n > 1. Assume without loss of generality that n is even. We 
partition the input string a into a concatenation a = a' a" of two strings of 
length |. Each of the strings a' , a" is a permutation string of length |. 

The semi-local seaweed matrices P a ',b, Pa",b, are each over ( — | : n | : 
^p), and each contain ^ nonzeros. The semi-local seaweed matrix P a ^ is 
over (— n : n | : 2n), and contains 2n nonzeros. 

Note that for all l G (0 : n), we have -P a ',b(^ i) = L whenever 6(z) g" E(a'). 
More formally, let (0 : n) = I' U I", where 

/' = {? G (0 : n), such that G S(a')} 
I" = [i e (0 : n), such that G S(a')} 

Sets /" can be computed easily from strings a', a", 6, at the cost of sorting 
their character sets. We have a decomposition 

Pa'fi = Pa>,b((- r j : 0) U I' | J' U <n : f >) U | /") 

The two permutation matrices in the above decomposition are of size n and 
^, respectively. Only the n nonzeros in the former matrix are non-trivial; 
they can be obtained by solving recursively the semi-local LCS problem on 
strings a' and 6|s(a') = b(I'), both of which are permutations strings with 
alphabet size ^. Similarly, only n out of nonzeros of P a ",b are non-trivial; 
they can be obtained by solving recursively the semi-local LCS problem on 
strings a" and 6|s( a ") = HI")- 

Finally, given matrices P a >,b, Pa",b, the output matrix P a ^ is computed by 
a call to the algorithm of Corollary 3, which calls the algorithm of Theorem 8 
as a subroutine. Note that we now have two nested recursions: the current 
recursion, and the recursion of Theorem 8. 

(End of recursive step) 

Cost analysis. The recursion tree is a balanced binary tree of height log n. 
In the root node, the running time is dominated by the call to the algorithm 
of Corollary 3, and is therefore 0(n log n). In each subsequent level of the 
recursion tree, the number of nodes doubles, and the running time per node 
is reduced by at least a factor of 2. Therefore, the running time per level is 
0(n log n). The overall running time is logn • 0(n log n) = 0(n log 2 n). ■ 
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dehcbafg dehcbafg dehcbafg 




Figure 7.1: Snapshots of Algorithm 4 (semi- local LCS between permutation 
strings) 

Example 23 Figure 7.1 shows a series of snapshots of an execution of Al- 
gorithm 4 on permutation strings a = "cfaedhgb", b = "dehcbafg". □ 

By keeping the algorithm's intermediate data, we obtain a data structure 
that allows efficient traceback of any semi-local LCS query on a pair of 
permutations, in time proportional to the size of the output (i.e. the length 
of the output subsequence) . A related problem of tracing back LIS in every 
substring of a fixed size in a permutation has been studied by Albert et al. 
[6] and by Chen et al. [35]. In particular, work [35] gives an algorithm that 
runs in time proportional to the size of the output (i.e. the combined lengths 
of all the output subsequences). In the same work, the algorithm is also 
generalised for tracing back the LIS in an arbitrary subset of n substrings, 
possibly of different sizes. In both versions of the problem, the size of the 
output, and therefore the algorithm's running time, can be as high as 0(ra 2 ). 
In contrast, Algorithm 4 allows efficient LIS traceback for any individual 
substring. 

7.2 Cyclic LCS between permutations 

The cyclic LCS problem has been defined in Section 4.5. Given permutation 
strings a, b of length n, the cyclic LCS problem on a, b is equivalent to the 
LIS problem on a circular string. 

This problem has been considered by Albert et al. [5] , who gave a Monte 
Carlo randomised algorithm, running in time 0(n 15 log n) with small error 
probability. 

We now give a new algorithm for cyclic LCS between permutations, 
that improves on the above algorithm both in running time, and by being 
deterministic. First, we call Algorithm 4, obtaining the semi-local seaweed 
matrix P a ,b- Then, we run the algorithm of Corollary 3 on matrix P a ^ 
against itself, obtaining the semi- local seaweed matrix P aa ,b- Finally, we 
perform n substring-string LCS queries for every substring of aa of length 
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n against string b. The overall running time is dominated by the call to 
Algorithm 4, which runs in time 0(n log 2 n). 

A version of the cyclic LCS problem between permutations, parame- 
terised by the output LCS length I, has also been considered by Albert et 
al. [5], who gave an algorithm running in time 0(n/logn). This was im- 
proved upon by Deorowicz [46], who gave an algorithm running in time 
0(min(re/, relogre + £ 3 logn)). Our algorithm described above improves on 
the algorithm of [46], unless I = o((n log n) 1 / 3 ). 

7.3 Longest pattern- avoiding subsequences 

Two given permutation strings a, b of equal length (but generally over dif- 
ferent alphabets) are called isomorphic, if they have the same relative order 
of characters, i.e. a{i) < a(j) iff b(i) < b(j) for all %, j. Given a target per- 
mutation string t of length n and a pattern permutation string p of fixed 
length, the longest p-isomorphic subsequence problem, or simply the longest 
p-subsequence problem, asks for the longest subsequence of t that is isomor- 
phic to p. More generally, given a set of pattern permutation strings X, the 
longest X -subsequence problem asks for the longest subsequence of t that is 
isomorphic to any pattern string in p. 

Example 24 The LIS problem can be interpreted as the longest X-subsequence 
problem, where X is a set of identity permutation strings, one of each length 
m G [1 : n]. □ 

Given a set of antipattern permutation strings Y, the longest Y -avoiding 
subsequence problem asks for the longest subsequence of t that does not 
contain a subsequence isomorphic to any string in Y. 

Example 25 The LIS problem on a permutation string can be interpreted 
as the longest { "21" }-avoiding subsequence problem. □ 

For a detailed introduction into these problems and their connections, 
see the work by Albert et al. [4] and references therein. 

The LIS problem is the only nontrivial example of the longest Y-avoiding 
subsequence problem with antipatterns of length 2. Albert et al. [4] gave the 
full classification of the longest Y-avoiding subsequence problem for all sets 
of antipatterns of length 3. There are 10 non-trivial sets of such antipat- 
terns. For each of these sets, the algorithms given in [4] run in polynomial 
time, ranging from 0(n log n) to 0(n 5 ). Two particular antipattern sets 
considered in [4] are (in that work's original notation): 

C 3 = {"132", "213", "321"} 
C 4 = {"132", "213", "312"} 
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For both these antipattern sets, algorithms given in [4] run in time 0(n 2 log n). 

We now give new algorithms for the longest C3- and ^-avoiding subse- 
quence problems, improving on the above algorithms in running time. 

Permutation strings that are C3-avoiding are all cyclic rotations of an 
increasing permutation string. The longest such subsequence in the target 
string can be found by the algorithm for the cyclic LCS problem between 
permutations (Section 7.2), running in time 0(n log 2 n). 

Permutation strings that are C4-avoiding are all obtained from an in- 
creasing permutation string by reversing some suffix. The longest such sub- 
sequence in the target string can be found as follows. Let the target string 
t be over the alphabet (0 : n). First, we call the standard LIS algorithm on 
t, obtaining explicitly prefix-prefix LCS scores 

ics{t]{i+\),id]{t{i) + \j) = ks(t](i- l),id]{t(i) - D) + 1 

for all 1 G (0 : n). Independently, we call Algorithm 4 on t against the 
reverse identity permutation id, and use Theorem 1 to process its output 
into a data structure that allows efficient queries of all suffix-prefix LCS 
scores lcs(t J k, id [l) for all k, I G [0 : n]. Finally, we obtain the solution to 
the longest Cj-avoiding subsequence problem as 

max (lcs(t](i+ \),id]{t{i) + £)) + lcs(t J (1 + £), id [(t(i) + §))) 

The overall running time is dominated by the call to Algorithm 4, which 
runs in time 0(n log 2 n). 

7.4 Longest piecewise monotone subsequences 

The classical LIS problem asks for the longest increasing (or, equivalently, 
decreasing) subsequence in a permutation string. A natural generalisation 
is to ask for the longest subsequence that consists of a constant number of 
monotone pieces. In particular, given a permutation string a of length n, the 
longest k-increasing subsequence (respectively, longest k-modal subsequence) 
problem asks for the longest subsequence in a that is a concatenation of at 
most k sequences, all of which are increasing (respectively, alternate between 
increasing and decreasing) . In the case of the longest A;- modal subsequence 
problem, we assume without loss of generality that k is even. Both problems 
can be solved as an instance of the LCS problem, comparing the input 
permutation string a against string id k , i.e. the concatenation of k copies of 
the identity permutation id (respectively, against string (idid) k ^ 2 , i.e. the 
concatenation of k alternating copies of id and its reverse id). The resulting 
alignment dag is of size n x kn, and contains kn match cells. Using standard 
sparse LCS algorithms [69, 10], such an instance of the LCS problem can be 
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(a) The chord model 



(b) An interval model 



Figure 7.2: A circle graph and its maximum clique 

solved in time 0(nklogn). Demange et al. [45] gave a similar algorithm for 
the longest k- modal subsequence problem, also running in time 0(nklogn). 

We now give a new algorithm for the longest fc-increasing subsequence 
and the longest k- modal subsequence problems, improving on the above 
algorithms in running time, unless k is very small. 

Our algorithm is as follows. In the case of the longest fe-increasing sub- 
sequence problem, we run Algorithm 4, obtaining the semi-local seaweed 
matrix Pid, a - Then, we extract the string-substring seaweed matrix Pfj a , 
and run on it log k times the algorithm of Corollary 1 , obtaining the string- 
substring seaweed matrix P a , k . In the case of the longest /c-modal subse- 

° id",a ° 

quence problem, we assume without loss of generality that k is even. We 
run Algorithm 4 twice, obtaining the semi-local seaweed matrices Pid, a an d 
Pjd a , from which we obtain the matrix P id j^ by Corollary 3. Then, we 
extract the string-substring seaweed matrix P a 1 — , and run on it log k — 1 

id id, a 

times the algorithm of Corollary 1, obtaining the string-substring seaweed 
matrix a - Finally) for both problems we use the resulting string- 

substring seaweed matrix to query the global LCS score. The described 
algorithm runs in time 0(n log 2 n) + logfc • 0(n log n) = 0(n log 2 n). This 
is an improvement on both the sparse LCS approach and the algorithm of 
[45], unless k = O(logn). 

7.5 Maximum clique in a circle graph 

A circle graph [51, 60] is defined as the intersection graph of a set of chords 
in a circle, i.e. the graph where each node represents a chord, and two nodes 
are adjacent, whenever the corresponding chords intersect. We consider the 
maximum clique problem on a circle graph. 

The interval model of a circle graph is obtained by cutting the circle at 
an arbitrary point and laying it out on a line, so that the chords become 
(closed) intervals. The original circle graph is isomorphic to the overlap 
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graph of its interval model, i.e. the graph where each node represents an 
interval, and two nodes are adjacent, whenever the corresponding intervals 
intersect but do not contain one another. 

Example 26 Figure 7.2 shows an instance of the maximum clique problem 
on a six- node circle graph. Subfigure 7.2a shows the set of chords defining 
a circle graph, with one of the maximum cliques highlighted in bold red. 
The cut point is shown by scissors. Subfigure 7.2b shows the corresponding 
interval model; the dotted diagonal line contains the intervals, each defined 
by the diagonal of a square. The squares corresponding to the maximum 
clique are highlighted in bold red. □ 

It has long been known that the maximum clique problem in a circle 
graph on n nodes is solvable in polynomial time [56]. A number of algorithms 
have been proposed for this problem [114, 67, 99, 9]; the problem has also 
been studied in the context of line arrangements in the hyperbolic plane 
[76, 49]. Given an interval model of a circle graph, the running time of the 
above algorithms is 0{n 2 ) in the worst case, i.e. when the input graph is 
dense. In [122, 124], we gave an algorithm running in time 0(n L5 ). 

We now give a new algorithm for the maximum clique problem in a circle 
graph, improving on existing algorithms in running time. The algorithm is 
based on the fast matrix distance multiplication procedure of Theorem 8. 

Our algorithm takes as input the interval model of a circle graph G on 
n nodes. Without loss of generality, we may assume that the set of interval 
cndpoints is (0 : 2n). The interval model is represented by a permutation 
string a of size 2n, where for each left (respectively, right) interval endpoint 
i £ (0 : 2n), a{i) is the corresponding right (respectively, left) endpoint. 
Note that for all i < j, an interval with left endpoint i does not contain an 
interval with left endpoint j, if and only if a(i) < a{j). Various alternative 
representations of interval models (e.g. the ones used in [114, 9]) can be 
converted to this representation in linear time. 

In the interval model, a clique corresponds to a set of pairwise intersect- 
ing intervals, none of which contains another interval from the set. Recall 
that intervals in the line satisfy the Helly property: if all intervals in a set 
intersect pairwise, then they all intersect at a common point. In our context, 
we only need to consider integer indices as intersection points. 

Consider a clique in G. Let £ [1 : 2n — 1] be a common intersection 
point of the intervals representing the clique, which is guaranteed to exist 
by the Helly property. Since the intervals representing the clique cannot 
contain one another, the sequence of their right endpoints is an increasing 
subsequence of a. Let id be the identity permutation string of length In. 
From the observations above, it follows that the clique corresponds to a 
common subsequence of a prefix a ] k and a suffix id J k. Therefore, the 
maximum clique can be solved as an instance of the semi- local LCS problem. 
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Algorithm 5 (Maximum clique in a circle graph) 

Input: interval model of circle graph G, represented by permutation string 
a of size 2n. 

Output: maximum-size clique of G, represented by the set of (say) left 
endpoints of the corresponding intervals. 

Description. 

First phase. We run Algorithm 4 on the input permutation string a against 
the identity permutation string id, obtaining a semi- local seaweed matrix 
of size 4ra. We then build the data structure of Theorem 1 for querying 
semi- local LCS scores of a against id. 

Second phase. For each k € [1 : 2n — 1], we query the LCS score of prefix 
a ] k against suffix id J k. The maximum of the 2n returned values gives the 
size of the maximum clique in G, and the corresponding value k gives a 
common intersection point of the clique intervals. 

Third phase. The intervals defining the maximum clique can now be ob- 
tained by running a standard LIS algorithm on string (a] fc)|s(idj k)- 

Cost analysis. 

First phase. The running time of Algorithm 4 is 0(rolog 2 n). 

Second phase. By Theorem 1, the combined running time of all the prefix- 
suffix queries is 0(n log 2 n), if the queries are performed independently. This 
time can be reduced to 0(n) by observing that the queries can be performed 
as a single diagonal batch query. 

Third phase. The LIS algorithm runs in time O(nlogn). 

Total. The overall running time is 0(n log 2 n). ■ 

Like many algorithmic problems, the problem of finding a maximum 
clique in a circle graph admits various parameterised versions. Some relevant 
parameters are: 

• the size I of the maximum clique; 

• the thickness d of the interval model, i.e. the maximum number of 
intervals containing a point, taken across all points in the line; 

• the number e of graph edges. 

For any interval model of a non-trivial circle graph, we have I < d < n < 
e < n 2 . Notice that, given a permutation representing an interval model, its 
thickness can be found in time 0(n log 2 n) by building a range tree on the 
corresponding set of planar points, and then performing 0(n) dominance 
counting queries. 
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Apostolico et al. [9] give algorithms for the parameterised version of the 
maximum clique problem in a circle graph, running in time 0(n log n+e) and 
0(n log n + nl\og(n/l)Y They also describe an algorithm for the maximum 
independent set problem, parameterised by the interval model's thickness. 

We now give a new algorithm for the maximum clique problem in a circle 
graph, parameterised by the thickness of the input interval model. Our 
algorithm improves on the parameterised algorithms of [9] for most values 
of the parameters. The algorithm is an extended version of Algorithm 5. 

Algorithm 6 (Maximum clique in a circle graph, parameterised by 
thickness) 

Input: interval model of circle graph G, represented by string a of size 2n. 

Output: maximum-size clique of G, represented by the set of (say) left 
endpoints of the corresponding intervals. 

Parameter: thickness d, d < n, of the input interval model. 

Description. 

First phase. We run Algorithm 4 on string a] (r + 1) against string id J rd, 
independently for all r G [0 : ^ — 1]. As will be shown in the algorithm's 
analysis, in each run we obtain an semi-local seaweed matrix with at most 
4d non-trivial nonzeros. For every r, we then build the data structure of 
Theorem 1 for querying semi-local LCS scores of a] (r + 1) against id \ rd. 

Second phase. For each k G [1 : 2n — 1], we query the LCS score of pre- 
fix a] k = (a] (r + l)d) ) k against suffix id J k = (id J rd) \ (k — rd), where 
r = [k/d\. The maximum of the 2n returned values gives the size of the 
maximum clique in G, and the corresponding value k gives a common inter- 
section point of the clique intervals. 

Third phase. The intervals defining the maximum clique can now be ob- 
tained by running a standard LIS algorithm on string (a] fc) | J k)- 

Cost analysis. 

First phase. We have a ] (r + l)d = (a] rd) ((a J rd) ] d) . The alignment dag 
of a] rd against id \ rd contains at most d match cells, since every match 
corresponds to an interval containing point rd, and there can be at most d 
such intervals by the definition of thickness. The alignment dag of (a J rd) ] d 
against id J rd also contains at most d match cells, since the length of the 
string (a J rd)] d is d. The alignment dag of a](r + l)d against id \ rd is 
the composition of the above two alignment dags, and therefore contains at 
most d + d = 2d matches. Therefore, the time for each run of Algorithm 4 is 
0(d log 2 d), and the overall running time of this phase is 0(n/d- dlog 2 d) = 
0(n log 2 d). 

Second phase. By Theorem 1, the combined running time of all the prefix- 
suffix queries is 0(n log 2 d), if the queries are performed independently. This 
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time can be reduced to 0{n/d- d) = 0(n) by observing that the queries can 
be performed as a single diagonal batch query. 

Third phase. The alignment dag of a ] k against id J k contains at most d 
matches, since every such match corresponds to an interval containing point 
k. Therefore, string (a] k) nas length at most d. The LIS algorithm 

runs in time 0(d log d). 

Total. The resulting overall running time is 0(nlog 2 d). ■ 

Algorithm 6 improves on the (9(nlogn + e) algorithm of [9], unless e = 
o(n log 2 d) = 0(n log 2 n). It also improves on the 0(n log n + n/log(n//)) 

algorithm of [9], unless I = o( ' 1 ° g gn ti ) = O(logn). 

7.6 Maximum common pattern between linear graphs 

The concept of a linear graph, introduced by Davydov and Batzoglou [44], 
is similar to an interval model of a circle graph defined in Section 7.5. The 
interval relations of disjointness, containment and overlapping are denoted 
respectively by symbols <, C and A pattern in a linear graph is defined 
as an ordered subset of intervals, all of which satisfy pairwise a prescribed 
subset of relations. 

Fertin et al. [52] considered the maximum common S-structured pattern 
(S-MCSP) problem. The problem asks for the maximum common pattern 
in a set of n linear graphs, each defined by at most m intervals, where the 
structure of the common pattern is restricted by a prescribed subset of re- 
lations S C {<,□,$}. In particular, the {$}-MCSP problem asks for the 
maximum commonly-structured subset of pairwise overlapping intervals; for 
n = 1 this is equivalent to finding the maximum clique of a circle graph, 
and for general n is equivalent to finding the minimum-sized clique among 
maximum cliques of the n input circle graphs. The {<, C}-MCSP prob- 
lem asks for the maximum commonly-structured subset of intervals, no two 
of which are overlapping; for n = 1 this is equivalent to finding the maxi- 
mum independent set of a circle graph; however, for general n the maximum 
commonly-structured independent set of the n input circle graphs may be 
significantly different from (and smaller than) each of the n individual maxi- 
mum independent sets. The {<, C, $}-MCSP problem asks for the maximum 
commonly-structured subset of intervals without any a priori restriction on 
its structure. 

Extending and generalising a number of previous results, paper [52] con- 
siders the S-MCSP problem, where S runs over all seven nonempty subsets 
of {<, C, §}. For some of these seven variants, the algorithms use as a sub- 
routine the algorithm of [122, 124] for the maximum clique problem in a 
circle graph. By plugging in the more efficient Algorithm 5, we can ob- 
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tain improved algorithms for those variants of the S-MCSP problem, where 
finding the maximum clique in a circle graph is a bottleneck. 

In particular, the {$}-MCSP problem is solved in [52] by finding the 
maximum clique independently for n circle graphs, each corresponding to 
one of the input linear graphs, in overall time 0(nm L5 ). By plugging in 
Algorithm 5, the running time is improved to 0{nm log 2 m). 

The {<, $}-MCSP problem is shown in [52] to be NP-hard, and to ad- 
mit a polynomial-time 2/i(fe)-approximation, where h(k) = X)i<i<fcV* = 
Inn + 0(1); for the rest of this section, k denotes the size of the opfimal so- 
lution to the problem. The approximation is obtained by 0(nm) calls to the 
following subroutine: given a circle graph of size m, and two integers mi, mi, 
decide whether the graph contains mi disjoint cliques, each of size m,2- This 
subroutine is performed in time 0(m 2 ' 5 logm), and therefore the overall run- 
ning time is 0{nm) ■ 0(m 2 ' 5 log m) = 0(nm 3,5 log m). By a straightforward 
extension of Algorithm 5, the running time of the subroutine is improved 
to 0{m log 2 m), and therefore the overal running time of the approximation 
algorithm is improved to 0(nm) ■ 0(m log 2 m) = 0(nm 2 log 2 m). 

The $}-MCSP problem is also shown in [52] to be NP-hard, and to 
admit a polynomial-time /c 1//2 -approximation. The approximation is ob- 
tained by combining exact solutions for the {c}-MCSP and {$}-MCSP 
problems on the same input sets. The exact solution for the {$}-MCSP 
is the bottleneck; by plugging in the improved algorithm for this problem 
described above, the running time of the approximation algorithm for the 
{C,$}-MCSP problem is improved from 0(nm 15 ) to 0(nm log 2 m). 

Finally, paper [52] argues that the {<, C, $}-MCSP problem is NP-hard, 
and gives several polynomial-time approximation algorithms. In particular, 
it gives an 0(fc 2 / 3 )-approximation algorithm running in time 0(nm 15 ), and 
an 0[(k log fc) 1 / 2 ) -approximation algorithm running in time 0(nm 3 5 log m). 
By using the techniques described above, the running times of these approxi- 
mation algorithms are improved respectively to 0(nm log 2 m) and 0(nm 2 log 2 
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Chapter 8 

Compressed string 
comparison 



8.1 Grammar-compressed strings 

String compression is a classical area of computer science. It is natural to 
ask whether compressed strings can be processed efficiently without decom- 
pression. Early examples of such algorithms were given e.g. by Amir et al. 
[8] and by Rytter [115]. 

We consider the following general model of compression. 

Definition 30 Let f be a string of length n (typically large). String t will 
be called a grammar-compressed string (GC-string), if it is generated by a 
context-free grammar, also called a straight-line program (SLP). An SLP of 
length n, n < n, is a sequence of n statements. A statement numbered k, 
1 < k < n, has one of the following forms: 

tk = a where a is an alphabet character 

tk = Utj where 1 < i, j < k □ 

We identify every symbol t r with the string it represents; in particular, we 
have t = tfi- In general, the plain string length n can be exponential in the 
GC-string length ft. 

Example 27 The Fibonacci string "abaababaabaab" of length 13 can be 
represented by the following SLP of length 7: 

t\ = 'b' t2 = 'a' 

t3 = htl ti = t^t2 t$ = t^ te = t^t^ t-j = ^6^5 

In general, a Fibonacci string of length n can be represented by an SLP of 
length n, where n = Fn = — o(l)) ( 1+ 2 )" ^ s ^ ne ^~th Fibonacci number. 
This example is borrowed from Hermelin et al. [65]. □ 
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Kida et al. [79] introduced a more general compression model, called col- 
lage systems. Grammar compression is a equivalent to a subclass of collage 
systems called regular. As a special case, grammar compression includes the 
classical LZ78 and LZW compression schemes by Ziv, Lempel and Welch 
[134, 131]. Both these schemes can be expressed by an SLP that consists of 
three sections: 

• in the first section, all statements are of the form tk = a; 

• in the second section, all statements are of the form tk = Utj, where 
statement j is from the first section; 

• in the third section, all statements are of the form tk = tk-itj, where 
statement j is from the second section. 

It should also be noted that certain other compression methods, such as e.g. 
LZ77 [133] and run-length compression, do not fit directly into the grammar 
compression model. 

The algorithms in this section will take as input a grammar-compressed 
text string t of length n, generated by an SLP of length n, and a plain 
pattern string p of length m. We aim at algorithms with running time that 
is a low-degree polynomial in m, n, but is independent of n (which could be 
exponential in n). 

8.2 Global subsequence recognition 

The global subsequence recognition problem has been defined in Section 3.1. 
We recall that the problem asks whether the text t contains the pattern p 
as a subsequence. It is a simple special case of the LCS problem. 

In this section, we revisit the global subsequence recognition problem, 
now assuming a GC-text t and a plain pattern p as inputs. In this setting, 
the problem can be solved by the following simple folklore algorithm. 

Algorithm 7 (Global subsequence recognition) 

Input: SLP of length n, generating text string t of length n; plain pattern 
string p of length m. 

Output: an integer k, giving the length of the longest prefix of p that is a 
subsequence of t. String t contains p as a subsequence, if and only if k = m. 

Description. Recursion on the input SLP generating t. 

Recursion base: n = n = 1. The output value k € {0, 1} is determined by a 
single character comparison. 

Recursive step: n > n > 1. Let t = t't" be the SLP statement defining string 
t. Let k' be the length of the longest prefix of p that is a subsequence of t'. 
Let k" be the length of the longest prefix of p \ k' that is a subsequence of 
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t". We call the algorithm recursively to obtain kl and k", and then return 
k = k' + k". 

(End of recursive step) 

Cost analysis. The running time of the algorithm is 0{kn). The proof is 
by induction. The running time of the recursive calls is respectively O(k'n) 
and 0{k"n). The overall running time of the algorithm is 0(k'n) + 0{k"n) + 
O(l) = 0(kn). In the worst case, this is 0(mn). ■ 

8.3 Three-way semi-local LCS 

We recall from Chapter 4 that the LCS problem on a pair of plain strings 
can be solved in time O ( lo ^" n ) , assuming that m < n and that m and n are 
reasonably close, by the micro-block method of Masek and Paterson [98]. 
The LCS problem on a pair of GC-strings has been considered by Lifshits 
and Lohrey [94], and proven to be NP-hard. 

In this section, we revisit the LCS problem, now assuming a GC-text 
t and a plain pattern p as inputs. Recall that we aim at algorithms with 
running time independent of n (which could be exponential in n). This 
rules out any attempt at solving the full semi-local LCS problem, since 
the resulting semi- local seaweed matrix would require memory 0(m + n). 
However, we are still able to consider the three-way semi-local LCS problem, 
excluding the computation of LCS on substrings of t. 

In the special case of LZ77 or LZW compression of the text, the algorithm 
of Crochemore et al. [40] solves the LCS problem in time 0(mn); in other 
words, LZ77 or LZW compression of one of the input strings only slows 
down the LCS computation by a polylogarithmic factor. 

The general case of an arbitrary GC-text appears more difficult. A GC- 
text is a special case of a context-free language, which consists of a single 
string. Therefore, the LCS problem between a GC-text and a plain pattern 
can be regarded as a special case of the edit distance problem between a 
context-free language given by a grammar of size n, and a pattern string 
of size m. For this more general problem, Myers [102] gave an algorithm 
running in time 0{m^n + m 2 -n log n). In [125], we gave an algorithm for the 
three-way semi-local LCS problem between a GC-text and a plain pattern, 
running in time 0(m L5 n). Lifshits [93] asked whether the LCS problem in 
the same setting can be solved in time 0(mn). 

A new algorithm for the three-way semi-local LCS problem can be ob- 
tained by an application of the techniques described in Chapter 3. The 
resulting algorithm improves on existing algorithms in running time, and 
approaches an answer to Lifshits' question within a logarithmic factor. 
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Algorithm 8 (Three-way semi-local LCS) 

Input: SLP of length n, generating text t of length n; plain pattern p of 
length m. 

Output: nonzeros of matrices P^ t , P° t , P° t . 

Description. First, we observe that, although the output matrices contain 
at most m nonzeros, the index range of these nonzeros is of size m + n, 
which may be exponentially larger. To avoid an exponential growth of the 
indices, we will clean up the range by removing unused indices, and deleting 
the corresponding zero row-column pairs from the matrices. Formally, we 
describe this process as an order-preserving remapping of the index range. 

First phase. Recursion on the input SLP generating t. 

Recursion base: n = n = 1. The output can be computed by a linear sweep 
of string p. 

Recursive step: n > n > 1. Let t = t't" be the SLP statement defining string 
t. We call the algorithm recursively to obtain the nonzeros of matrices 

pa pE pn pa pE pa 

r p,t' r P ,t' r P ,t' r P ,t" r p,t" r P ,t" 

The total number of nonzeros in each matrix triple is between m and 2m. 
Conceptually, these matrices are submatrices of P p ^> over (— m : n' | : 
m + n'), and P Pjt >i over (— m : n" \ : m + n"). However, the actual 
remapped index range after the recursive calls is (— m : 2m \ : 3m) for 
both matrix triples. We now compute the composition seaweed matrices 

pa de pa 

p,t r p,t r p,t 

by Corollary 2. The total number of nonzeros in this matrix triple is again 
between m and 2m. Conceptually, these matrices are submatrices of P p j 
over (— m : n \ : m + n). However, the actual remapped index range after 
the composition is (— m : 4m | : 5m,). Therefore, there are at least 2m 
indices i G (0 : 4m), such that the row P^(i, *) and the column P H (*, i) both 
contain only zeros. We now delete exactly 2m such rows and columns from 
the respective matrices, and remap the index range to (— m : 2m | : 3m), 
while preserving the linear order of the indices. 

(End of recursive step) 

Second phase. We now have the nonzeros of the output matrices, remapped 
to the index range (— m : 2m | : 3m). This is already sufficient to query the 
global LCS score, or substring-string LCS scores for p against t. However, if 
explicit indices of the nonzeros in the output seaweed matrices are required, 
the index range can be remapped back to (—m:n\ : m + n) by reversing 
every remapping step in the recursion. 
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Cost analysis. 

First phase. By Corollary 2, each recursive step runs in time 0(m log m). 
There are n recursive steps in total, therefore the first phase runs in time 
0(mlogm ■ n). 

Second phase. For each nonzero, the inverse remapping can be performed 
recursively in time 0{n). There are m nonzeros in total, therefore the second 
phase runs in time 0{mn). 

Total. The overall running time is 0(m log m ■ n). ■ 

Algorithm 8 provides, as a special case, an algorithm for the LCS problem 
between a GC-string and a plain string, running in time 0(m log m ■ n); 
the LCS score can easily be queried from any one of the algorithm's three 
output matrices by Theorem 11. This running time should be contrasted 
with standard LCS algorithms on plain strings, running in time O ( \ ^^ +n ^ ) 
[98, 40], and with the NP-hardness of the LCS problem on two GC-strings 
[94]. 

Hermelin et al. [65] gave a more detailed analysis of the LCS problem 
on GC-strings, by considering the weighted alignment problem on a pair of 
GC-strings a, b of total compressed length f = fh + n, parameterised by the 
strings' total plain length r = m + n. They gave an algorithm running in 
time 0(r 134 f 1-34 ) for general weights, and in time C^r^f 1-4 ) for rational 
weights. 

In the case of rational weights, the parameterised running time of weighted 
GC-string alignment can be improved by the following straightforward algo- 
rithm. First, we uncompress one of the input strings — say, string b. Then, 
we run Algorithm 8 on the GC-string a as a text against the plain string b 
as a pattern. The resulting running time is 0(m log m ■ n) = 0{r log r • f). 



8.4 Local subsequence recognition 

The local subsequence recognition problem was introduced in Section 3.1 as 
a special case of the semi-local LCS problem. Definition 27 in Section 5.2 
described this problem as a variant of the approximate matching problem. In 
the context of local subsequence recognition, a substring of text t is called a 
matching substring, if it contains the pattern p as a subsequence. A matching 
substring will be called minimally matching, if it is inclusion-minimal, i.e. it 
has no proper matching substring. 

We recall that, depending on the output filtering, local subsequence 
recognition can take the following forms: the minimum-window subsequence 
recognition problem, which asks for the locations of all substrings of t that 
are minimally matching, and the fixed-window subsequence recognition prob- 
lem, which asks for the locations of all the matching substrings of a fixed 
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length w. A combination of these two problems is the bounded minimal- 
window subsequence recognition problem, which asks for the locations of all 
the minimally matching substrings below a fixed length w. 

Clearly, the output size for the described reporting versions of these 
problems may be exponential in n; therefore, we have to parameterise the 
running time by the output size, which we denote by output. An algorithm 
for the reporting version of any of the above problems can typically be 
converted to solve the corresponding counting version of the problem. Such 
a counting algorithm, instead of reporting all the matching substrings, only 
returns their overall number. The running time of the counting versions for 
all algorithms described in this section will correspond to the running time 
of the reporting algorithm with output = 0(1). 

The minimal-window, fixed-window and bounded minimal-window sub- 
sequence recognition problems for a GC-text against a plain pattern have 
been considered by Cegielski et al. [28]. For each problem, they gave an algo- 
rithm running in time 0(m 2 \ogm-n+ output) . In [125], we gave an algorithm 
improving the running time for these problems to 0(m L5 n + output). 

We now give a more efficient local subsequence recognition algorithm, 
based on Algorithm 8, which we extend as follows. In addition to the sea- 
weed matrices P? f , P?,, we now also make use of the seaweed cross- 
matrix P^ t , t „ . We extend every recursive step by the reporting of minimally 
matching substrings that are cross-substrings in the current seaweed matrix 
composition. 

Algorithm 9 (Local subsequence recognition) 

Input: SLP of length n, generating text t of length n; plain pattern p of 
length m. 

Output: locations (or count) of minimally matching substrings in t. 

Description. Similarly to Algorithm 8, index remapping has to be per- 
formed in the background in order to avoid an exponential growth of the 
indices. To simplify the exposition, we now assume constant-time index 
arithmetic, keeping the index remapping implicit. 

First phase. Recursion on the input SLP generating t. 

Recursion base: n = n = 1. As in Algorithm 8, the seaweed matrices Pp t , 
P^ t , Ppj can be computed by a linear sweep of string p. String t is matching, 
if and only if m = 1 and t = p; in this case, t is also minimally matching. 

Recursive step: n > n > 1. Let t = t't" be the SLP statement defining string 
t. We run a recursive step of Algorithm 8, obtaining the seaweed matrices 

pa pn pn 

p,t p,t r v,t 
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In addition, we obtain the seaweed cross-matrix P®., .„ by Corollary 2. This 
matrix has exactly m nonzeros. Let 

C = {(^,^)<<(is,js)<<...<<(i s _ h ,j s _ h )} 

be the <C-chain of all ^-maximal nonzeros in P^ t , t „, where s = \C\ < m. 

By Theorem 11, a substring t(i : j) is matching, if and only if point 
is not ^-dominated by any nonzeros in the seaweed matrix P Ptt . Recall that 
a substring t(i : j) is a cross-substring, if i £ [0 : n' — 1], j £ [n' + 1 : n]; 
in other words, a cross-substring consists of a non-empty suffix of i' and a 
non-empty prefix of t" . A point corresponding to a cross-substring can 
only be ^-dominated by nonzeros within the seaweed cross-matrix P^ t , t „. 
Therefore, a cross-substring t(i : j) is matching, if and only if point is 
not ^-dominated by any of the nonzeros in P^ t , t „, or, equivalently, by any 
point in C. 

Consider the set of all points in [— m : n' \ n' : m + n], not ^-dominated 
by any point in C The ^-minimal points in this set, excluding the irrelevant 
boundary points , and (n', [j s _i]), are interleaved with the points 
of C, and form themselves a ^-chain of size s — 1: 

m = {(L«§J. Till) « (1*§J> fi§l) « ••• « (K-iJ Ji s - f l)} 

Let i £ [0 : n' — 1], j € [n' + 1 : n]. Then, a cross-substring t(i : j) is 
minimally matching, if and only if 6 .M. The number of such points 
(i, j) is at most \M\ = m — 1. 

(End of recursive step) 

Second phase. For every SLP symbol, we now have the locations of its min- 
imally matching cross-substrings. Furthermore, every non-trivial substring 
of t corresponds to a cross-substring for some SLP symbol, under an appro- 
priate transformation of indices. By another recursion on the structure of 
the SLP, it is now straightforward to obtain either the locations or the count 
of all the minimally matching substrings in t. 

Cost analysis. 

First phase. As in Algorithm 8, each seaweed matrix composition runs in 
time 0(m log m). The <C-chains C and M can be obtained in time 0(m). 
Hence, the running time of a recursive step is 0{m log m) . There are n recur- 
sive steps in total, therefore the whole recursion runs in time 0(m log m-n). 

Second phase. For every SLP symbol, there are at most m — 1 minimally 
matching cross-substrings. Given the output of the first phase, the locations 
of all minimally matching substrings in t can be reported in time 0(mn + 
output). 

Total. The overall running time is 0(m log m • n + output). ■ 
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Example 28 Figure 8.1 shows a snapshot of a recursive step in the first 
phase of Algorithm 9. Subfigure 8.1a shows the seaweed cross-matrix P^ t , t „; 
in this particular example, all its nonzeros belong to the string-substring sea- 
weed matrix P°, ,„. Subfigure 8.1b shows the corresponding seaweed braid. 
Matrix P^ t , t „ contains m = 5 nonzeros, shown by green bullets in Subfig- 
ure 8.1a, and by green seaweeds in Subfigure 8.1b. Out of these five nonze- 
ros, three are ^-maximal; they are shown by larger bullets (respectively, 
by thicker seaweeds). The three ^-maximal nonzeros form the <C-chain C 
Consequently, there are 3 — 1 = 2 points in the interleaved <C-chain Ai. In 
Subfigure 8.1a, these two points are shown by asterisks; in Subfigure 8.1b, 
the corresponding two substrings of t are shown by dotted brackets. The 
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interleaving between <C-chains C and Ai is shown in Subfigure 8.1a by solid 
black lines. Both points of M lie within the range [0 : n' — 1 | n' + 1 : n], and 
therefore each of them corresponds to a minimally matching cross-substring 
in t. 

By Theorem 11, a substring in t is matching, if and only if the corre- 
sponding rectangle in the alignment dag is not pierced by a seaweed en- 
tering at its left-hand boundary and leaving at its right-hand boundary. 
Notice that the bracketed substrings of t in Figure 8.1b are exactly the two 
inclusion-minimal cross-substrings satisfying this property. □ 

An algorithm for the fixed- window subsequence recognition problem can 
be obtained from Algorithm 9 as follows. Substrings t(i : j) of length w 
correspond to points lying on the diagonal j — i = w in the semi- 

local score matrix H P) t- Consider the set of all points on this diagonal, 
^-dominated by any point in the <C-chain C, introduced in Algorithm 9. 
This set consists of a (not necessarily disjoint) union of s diagonal intervals 

hi = \ (i, i + w) such that i £ [|\] : [juj — w] \ 



where any interval of negative length is by convention considered empty. In 
every recursive step, the interval endpoints in the set hi can be computed in 
time 0(m). 

Let i € [0 : n' — 1], i + w € [n' + l : n]. Then, a cross-substring t(i :i + w) 
is matching, if and only if (i,i+w) hi. Therefore, each point corresponding 
to a cross-substring of t can be reported in constant time. 

An algorithm for the bounded minimal-window subsequence recognition 
problem can be obtained from Algorithm 9 by discarding in every recursive 
step the minimally matching cross-substrings of length exceeding w. 

The overall running time of both the above modifications of Algorithm 9 
is still 0(m log m • n + output). 

8.5 Threshold approximate matching 

The threshold approximate matching problem was introduced by Defini- 
tion 26 in Section 5.2. The problem assumes an alignment score with ar- 
bitrary weights. In the context of threshold approximate matching, a sub- 
string of text t is called a matching substring, if it has alignment score at 
least h against pattern p (alternatively, edit distance at most k), where h 
(respectively, k) is a fixed threshold. 

Approximate pattern matching on compressed text has been studied by 
Karkkainen et al. [75]. For a GC-text of length fh, an uncompressed pattern 
of length n, and an edit distance threshold k, the (suitably generalised) 
algorithm of [75] solves the threshold approximate matching problem in time 
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0(mnk 2 + output). In the special case of LZ78 or LZW compression, the 
running time is reduced to 0(mnk+ output). Bille et al. [21] gave an efficient 
general scheme for adapting an arbitrary threshold approximate matching 
algorithm to work on a GC-text. In particular, when the algorithms by 
Landau and Vishkin [90] and by Cole and Hariharan [37] are each plugged 
into their scheme, the resulting algorithm runs in time 0{n(mm(mk, k 4 + 
m) + log 2 re) + output) = 0(n • min(m/c, k 4 + m) + re 2 ) + output^. In the 
special case of LZ78 or LZW compression, Bille et al. [18] show that the 
running time can be improved to 0(n ■ min(rei&:, k A + m) + output). 

Using the techniques of the previous sections, we now show how the 
threshold approximate matching problem on a GC-text can be solved more 
efficiently, unless the edit distance threshold k is very small. The algorithm 
extends Algorithms 8 and 9, and assumes an arbitrary rational- weighted 
alignment score. As in Algorithm 9, we assume for simplicity the constant- 
time index arithmetic, keeping the index remapping implicit. 

Algorithm 10 (Threshold approximate matching) 

Parameters: character alignment weights w®, w®, w e , assumed to be con- 
stant rationals. 

Input: SLP of length re, generating text string t of length re; plain pattern 
string p of length m; score threshold h. 

Output: locations (or count) of matching substrings in t. 

Description. 

First phase. Recursion on the input SLP generating t. 

To reduce the problem to an unweighted LCS score, we apply the blow- 
up technique described in Section 5.1. Consider the normalised weights 
(5.1), and define the corresponding blown-up strings p, t of length rh = vm, 
re = vn, respectively. 

Recursion base: n = n = 1. The seaweed matrices P?-, P?~, P?- can be 

p,t v,t p,t 

computed by Algorithm 1 (the seaweed algorithm) in v linear sweeps of 
string p. Each of these matrices can be used to query the LCS score H- ^(0, v) 
between p and t. String t is matching, if and only if the corresponding 
weighted alignment score H Pt t(0, 1) is at least h. 

Recursive step: n > n > 1. Let t = t't" be the SLP statement defining string 
t. We have t = t't" for the corresponding blown-up strings. 

As in Algorithm 9, we run a recursive step of Algorithm 8, obtaining the 
seaweed matrices 

pa pB pa 

p,i p,i p,t 

In addition, we obtain the seaweed cross-matrix P?~, -„ by Corollary 2. This 

p,t ,£ 

matrix has exactly rh = vm nonzeros. 
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In contrast to Algorithm 9, it is no longer sufficient to consider just the 

^-maximal nonzeros of P?~, - „; we now have to consider all its rh nonze- 

p,t ,t 

ros. Let us denote the coordinates of these nonzeros, in increasing order 
independently for each coordinate, by 

li <ls < ... <lrh-\ 3\ <J§ < ■•• <im-I ( 8 -l) 

Every nonzero of the seaweed cross-matrix can now be represented as 

for some s,t € (0 : rh). 

The first (respectively, second) of the index sequences (8.1) partitions 
the range [—rh : h'\ (respectively, [h' : rh + h]) into rh + 1 disjoint intervals 
of varying size. Therefore, we have a partitioning of the score cross-matrix 
H®~, 7„ into (m+l) 2 disjoint rectangular blocks of varying dimensions. Con- 
sider a particular block 

H p,i',t" :i i\5v ■■ ~3v] (8-2) 

where 

h = K-§1 *o = L*«+iJ = \j v -i] jt = [j v+ i\ 

for u, v € [0 : m]. For the boundary blocks, where some of the above bounds 
are not defined, we let 

Iq = -rh t~ l = ti Jo = n' jt = m + n 

All the entries in the block (8.2) are ^-dominated by a fixed set of 

nonzeros in P?~, -„. Let d be the number of nonzeros in this set. By Theo- 

p,t',t" J 

rem 11, all the entries within the block (8.2) have identical value: we have 
H p,t',t"^' J ^ = rh — d for all i & [i~ : j G [j~ : J+]. 

We now switch our focus from the blown-up strings p, t' , t" back to 
the original strings p, t' , t" . The partitioning of matrix H?~, r„ induces a 

partitioning of matrix 'H^ t , t ,, into (rh + l) 2 disjoint rectangular blocks of 
varying dimensions. The block corresponding to block (8.2) is 



v 

where 



Kt't'Su \3v -jt] (8-3) 



for u, v G [0 : rh]. For the boundary blocks, where some of the above bounds 
are not defined, we let 

ifii = n ' io =n ' jX = m + n 
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Note that, although the block (8.2) is never empty, the corresponding block 
(8.3) may be empty. This will be the case if either i~ > , or j~ > j+. 

We now show that the maximum entry across the block (8.3) always 
lies in its bottom-left corner. Recall that for the blown-up strings, we have 
^fcp,?' (*'■?') = - d for all i € [t~ : j G \j~ : J+]. By (5.2), for the 
original strings we have 

Kt'A^j) = ^ ■ K - 2^e) + (m + j - i) ■ w e 

where i G [i~ : j G [jj" : jjj"]. Recall that u> e < 0. Therefore, the score is 
maximised when j — i is minimised, and the maximum score is attained by 
the block's bottom-left entry Hp t , t "(i£,jv)- ^ w e < 0) then this maximum 
is strictly greater than all other entries in the block; If w e = 0, then all the 
entries in the block, including the bottom-left entry, have an identical value 

Without loss of generality, we now assume that all the blocks (8.3) are 
non-empty. The bottom-left entries attaining block maxima, taken across 
all the blocks, form an {fh + 1) x (fh + 1) submatrix 

M(u,v) = U^ t , t „(il, j~) 

where u,v G [0 : fh}. Since the matrix T-Lp t , t „ is anti-Monge, its block 
maxima submatrix M. is also anti-Monge. Every entry A4(u, v) can be 
obtained from the corresponding LCS score H^~, ~„(z^+, vj~) in time 0(1). 

We now obtain the set of all row maxima of matrix Ai by the algorithm 
of Lemma 2 (SMAWK algorithm with incremental queries), applied to the 
matrix —Ai in order to convert the maxima into the minima. The set i~h(M) 
of all entries in Ai scoring above the threshold h can now be obtained by a 
local search in the neighbourhoods of the row maxima. 

Let i G [0 : n' — 1], j G [n! + 1 : n]. Then, a cross-substring t(i : j) is 
matching, if and only if H^ t , t »(i,j) > h. Therefore, the set of all matching 
cross-substrings corresponds to the set Th (Hp t , t „ [0 : n' — 1 | n' + 1 : n] ) 
of all entries in the submatrix t , t „[0 : n' — 1 \ n' + 1 : n] scoring above 
the threshold h. This set can now be obtained by a local search in the 
neighbourhood of the bottom-left corner of each block corresponding to an 
element of Th{M). Any kind of filtering described in Section 5.2 can be 
applied to the output. 

(End of recursive step) 

Second phase. As in Algorithm 9, substituting "matching" for "minimally 
matching" . 

Cost analysis. 

First phase. As in Algorithm 8, each seaweed matrix composition runs in 
time O(mlogm) = 0(m log m). The algorithm of Lemma 2 runs in time 
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0(m) = 0(m). Hence, the running time of a recursive step is 0(m log m). 
There are n recursive steps in total, therefore the whole recursion runs in 
time O (m log m ■ n). 

Second phase. As in Algorithm 9, the locations of all matching substrings 
in t can be found in time 0(mn + output). 

Total. The overall running time is 0(m log m ■ n + output). ■ 

Algorithm 10 improves on the algorithm of [75], as long as logm = o(k 2 ) in 
the case of general GC-compression, and log m = o(k) in the case of LZ78 or 
LZW compression. Algorithm 10 also improves on the algorithms of [18, 21], 
as long as mlogm = o(min(mfc, k 4 + m) + nj in the case of general GC- 
compression, and m logm = o(min(m/c, k 4 + m)) in the case of LZ78 or LZW 
compression. 

Example 29 Figure 8.2 shows a simplified snapshot of a recursive step in 
the first phase of Algorithm 10, which is assumed to run on the same input 
as in Figure 8.1. For simplicity, we assume the unweighted LCS score, hence 
v = 1, and the blown-up strings are identical to the original input strings. 
Subfigure 8.2a shows the seaweed cross-matrix P^ t , ( „. Subfigure 8.2b shows 
the corresponding seaweed braid. Both the seaweed cross-matrix and the 
seaweed braid are identical to the ones in Figure 8.1. In Subfigure 8.2a, 
the partitioning into a grid of 6 x 6 blocks is shown by solid lines. The 
bottom-left entry in each block, attaining its maximum score, is shown by 
an asterisk; taken together, all these entries form the 6x6 matrix Ai. In 
Subfigure 8.2b, the corresponding substring boundaries are shown by dotted 
lines (except the ones that coincide with the alignment dag boundaries) . □ 
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(a) Seaweed cross-matrix F^ t , t „ and block partitioning 
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(b) Corresponding seaweed braid 
Figure 8.2: A snapshot of Algorithm 10 (threshold approximate matching) 
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Chapter 9 

Beyond semi-locality 



9.1 Window-local LCS and alignment plots 

So far, we have considered mostly global and semi-local string comparison. 
Our aim now is to approach local string comparison — the type of string 
comparison that is the most important for biological applications, but also 
the most difficult. In this chapter, we consider a version of local string 
comparison that is restricted to a fixed subset of prescribed substrings in one 
of the input strings, comparing them to all substrings in the other string. 

An important special case is where all the prescribed substrings have 
equal length. Given a fixed parameter w, we call a substring of length w a 
w -window in the corresponding string. 

String comparison in windows has a long history. One of its early in- 
stances is dot plots (also known as diagonal plots or dot matrices), intro- 
duced by Gibbs and Mclntyre [59] and by Maizel and Lenk [96]. In addition 
to numerical scores, dot plots provide a convenient visualisation of string 
comparison. In the context of dot plots, processing a pair of windows is 
usually referred to as filtering. The standard filtering method compares 
every w-window of string a against every w-window of string b in terms of 
their Hamming score, i.e. the count of matching character pairs under a rigid 
alignment model, where all characters must be aligned and no gaps are al- 
lowed. A Hamming-filtered dot plot can be computed in time 0(mn) by the 
algorithm of [96, 100]. This algorithm has been implemented in several soft- 
ware packages (see e.g. [120, 109, 36]). A faster suffix-tree based algorithm 
has been proposed and implemented by Krumsiek et al. [85]. Enhancement 
of the dot plot approach have been proposed by Huang and Zhang [68] and 
by Putonti et al. [108]. 

Numerous other methods of local string comparison have been proposed. 
The Smith-Waterman- Gotoh algorithm [119, 62] allows one to obtain the 
highest-scoring pair across all substring pairs in the input strings. It can also 
be generalised to report all substring pairs scoring above a certain threshold. 
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A significant drawback of the Smith- Waterman-Gotoh algorithm is that it 
generally favours long, less precise substring alignments over short, more 
precise ones (as noted e.g. by Arslan et al. [13]). The quality of the alignment 
is also dependent on the scoring scheme: for example, for the LCS score, 
the algorithm only provides the trivial global comparison, so the method 
is generally only useful for weighted alignment scores with sufficiently high 
penalties (negative score weights) for gaps. 

In contrast with the Smith-Waterman-Gotoh algorithm, the dot plot 
method gives the user more flexibility to select the biologically significant 
substring alignments, by providing all the local scores between fixed-size 
windows of the input strings. However, the Hamming scoring scheme used 
by this method within each window pair is less sensitive than even the LCS 
score, and especially than the weighted alignment score used by Smith- 
Waterman-Gotoh. This tradeoff motivates us to combine the best features 
of the two approaches in the following definition. 

Definition 31 Given strings a, b, and a window length w, the window- 
window (respectively, window- sub string) LCS problem asks for the LCS 
score of for every w- window in a against every w- window (respectively, every 
substring) in b. □ 

The window-window LCS problem can be seen as a refinement of the dot 
plot method and a complement to the Smith-Waterman-Gotoh method. 
As in the dot plot method, we compute all window-window comparison 
scores between the input strings. However, instead of the Hamming score, 
our method is based on the LCS score, and is therefore potentially more 
sensitive. The method can be further extended to use weighted alignment 
scores. By analogy with Hamming-filtered dot plots, we call the resulting 
matrix of window-window alignment scores an alignment-filtered dot plot, or 
simply an alignment plot. Recently, we applied the alignment plot method 
to the detection of alignment-conserved regions in DNA [106] . 

The solution of the window-substring LCS problem can be represented in 
space 0(mn log n) by the data structure of Theorem 1, built on the string- 
substring seaweed matrix for each window of a against b. An individual 
window-substring LCS score query can be performed on this data structure 
by Theorem 1 in time 0(log 2 n). The full explicit solution to the window- 
window LCS problem can be obtained in time O(mn), by performing a 
diagonal batch query directly on each of the string-substring seaweed matri- 
ces. Thus, string-substring seaweed matrices provide a unified solution for 
both the window-substring and the window-window LCS problems. 

A naive algorithm for the window-window LCS problem runs in time 
0(mnw 2 ), and for the window-substring LCS problem runs in time 0(mn 3 w). 
The running time for both problems can be improved by applying Algo- 
rithm 1 (the seaweed algorithm) independently to each window of string 
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(a) Canonical substrings and windows 




(b) Window decomposition forest 



Figure 9.1: An execution of Algorithm 11 (window-substring LCS) 

a against whole string b. The resulting algorithm runs in time 0(mnw). 
If window length w is sufficiently large, the running time can be improved 
slightly by using Algorithm 2 (the micro-block seaweed algorithm). 

We now give a new algorithm for the window-substring LCS problem 
(and therefore also for the window- window LCS problem as a special case). 
Our algorithm provides a further substantial improvement on the above 
approach, and matches the asymptotic running time of both the Hamming- 
scored dot plot and the Smith- Waterman-Gotoh methods. 

Algorithm 11 (Window-substring LCS) 

Input: strings a, b of length m, n, respectively; window length w. 

Output: nonzeros of the string-substring seaweed matrix for every w- window 
of string a against full string b. 

Description. Without loss of generality, we assume that m is a power of 
2. Let s be an arbitrary power of 2, 1 < s < m. We call a substring a(i : j) 
canonical, if j — i = s, and i, j are both multiples of s. In particular, both 
the whole string a, and every one of its individual characters, are canonical 
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substrings. Every substring of a can be decomposed into a concatenation of 
O(logm) canonical substrings. 

In the following, by processing a substring a' of a, we mean computing 
the string-substring seaweed matrix P^, fe . 

First phase. We process all canonical substrings of a by the following recur- 
sive procedure. 

Recursion base: m = 1. Matrix P a b is computed by a linear sweep of string 
b. 

Recursive step: m > 1. We have a = a'a" , where a' , a" are canonical sub- 
strings of length m/2. We call the first phase procedure recursively on each 
of a', a" against b, obtaining matrices PjjJ b , P^„ b . Then, we obtain the 
matrix = P* fi □ J» >6 by Corollary 1. 

(End of recursive step) 

Second phase. Let s be an arbitrary power of 2, 1 < s < m. We define 
(s,t) -window to be a substring a(i,j), such that j — i = st, and i and j are 
both multiples of s. Intuitively, t is the maximum number of windows that 
overlap at any single character in the string a. Note that a (l,t)-window 
is the same as a i-window, and an (s, l)-window is the same as a canonical 
substing. Given parameters s, t, we solve the problem of processing all 
(s, t)-windows by the following recursive procedure. 

Recursion base: t = 1. For any s, all (s, l)-windows are canonical substrings, 
therefore they have already been processed in the first phase. 

Recursive step: t > 1. If t is even, then we call the second phase recursively 
to process all (s,t — l)-windows. We then process all (s, t)-windows a(i : j) 
as follows: 

pH _ J a{i:j-s),b a{j-s:j),b /g 1\ 

a{i:j),b \ pa mpo V ' j 

V. a(i:i+s),b a(t+s:jj,b 

where the choice between the two cases is made arbitrarily. 

If t is odd, then we call the second phase recursively to process all 
(2s, ^)- windows. We then process all (s, t)-windows a(i : j) by (9.1), 
where the first case is followed for \ even and i odd, and the second case is 

s s ' 

followed for - odd and - even. 

s s 

The implicit distance products in (9.1) are computed by Corollary 1. 
In each case, the product is between two string-substring seaweed matrices: 
one for substring a(i : j — s) or a(i + s : j), already processed by the recursive 
call, and the other for a canonical substring a(j — s : j) or a(i : i + s). 

(End of recursive step) 
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Cost analysis. 

First phase. In every subsequent level down the recursion, the number of 
matrix multiplications doubles. The running time for each matrix multipli- 
cation decreases by a factor o(l). Therefore, the running time is dominated 
by the bottom level of the recursion, where we have m/2 matrix multiplica- 
tions, each running in time 0(n). The running time of the whole phase is 
m/2 • 0{n) = 0(mn). 

Second phase. In every two consecutive levels down the recursion, the num- 
ber of matrix multiplications is reduced by at least a factor of 2. The running 
time for each matrix multiplication increases by a factor o(l). Therefore, 
the running time is dominated by the top level of the recursion, where we 
have 0(m) matrix multiplications, each running in time 0{n). Therefore, 
the running time is O(m) ■ 0(n) = 0(mn). 

Total. The running time for both the first and the second phase, and there- 
fore the overall running time, is 0{mn). ■ 

Note that the asymptotic running time of Algorithm 11 is independent of 
the window length w. 

Example 30 Figure 9.1 shows an execution of Algorithm 11 on string a 
of length 16 with window size 7, against string b of arbitrary length. Sub- 
figure 9.1a shows the canonical substrings of a of lengths 1, 2, 4, 8, 16 in 
black, and windows of length 7 in blue. For each window, the figure shows 
its decomposition into canonical substrings. For one of the windows, high- 
lighed in thick red, the corresponding area is outlined in the alignment dag. 
Subfigure 9.1b represents substrings of a by points in the plane, and the 
decompositions into canonical substrings by a forest of trees. Each win- 
dow a(i : j) corresponds to a leaf of a decomposition tree. The canonical 
substrings in a decomposition of the window correspond to the edges on the 
path from the corresponding leaf to the root of the tree. The leaves, internal 
nodes and roots of decomposition trees are shown respectively by diamond, 
circle and square bullets. □ 

9.2 Quasi-local LCS 

We now consider an arbitrary set of prescribed substrings of various lengths 
in string a. We assume that all the prescribed substrings are non-empty, 
and denote their number by k, m < k < (™). 

Definition 32 Given strings a, 6, the quasi-local LCS problem asks for the 
LCS score of every prescribed substring in a against every substring in b. □ 
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The quasi-local LCS problem includes as special cases the semi-local, window- 
window, window-substring and fully-local LCS problems, as well as length- 
constrained local alignment considered by Arslan and Egecioglu [12] . The so- 
lution of the quasi-local LCS problem can be represented in space 0(kn log n) 
by the data structure of Theorem 1, built on the string-substring seaweed 
matrix for each prescribed substring of a against b. An individual quasi-local 
LCS score query can be performed on this data structure in time (9(log 2 n). 

A naive algorithm for the quasi-local LCS problem runs in time 0(kmn 3 ). 
This running time can be improved by applying Algorithm 1 (the seaweed 
algorithm) independently to each prescribed substring a against whole string 
b. The resulting algorithm runs in time O(kmn). If all the prescribed sub- 
strings are sufficiently long, then the running time can be improved slightly 
by using Algorithm 2 (the micro-block seaweed algorithm). 

We now give an algorithm for the quasi-local LCS problem that provides 
a further improvement on the above approach. 

Algorithm 12 (Quasi-local LCS) 

Input: strings a, b of length m, n, respectively; a set of k endpoint index 
pairs for the prescribed substrings in a. 
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Output: nonzeros of the string-substring seaweed matrix for every pre- 
scribed substring of string a against full string b. 

Description. The algorithm structure is similar to the one of Algorithm 11. 
First phase. As in Algorithm 11. 

Second phase. We process all prescribed substrings of a by the following 
recursive procedure. Let s be a parameter, assumed to be a power of 2. 
Initially, we set s = 1. At every level of recursion, the endpoint indices of 
the prescribed substrings are multiples of s. 

Recursion base: the set of prescribed substrings is empty. In this case, the 
problem is trivial. 

Recursive step: there are some prescribed substrings. First, we remove from 
consideration all canonical prescribed substrings, since they have already 
been processed in the first phase. We then call the second phase pro- 
cedure recursively with the parameter 2s, and the following set of pre- 
scribed substrings. For each currently prescribed non-canonical substring 
a{i,j), the corresponding new prescribed substring in the recursive call is 
a (^ s l"^l : ^ s l_iJ)' umess this substring is empty. Informally, we round i 
and j to the nearest multiple of 2s; index i is rounded up and index j down. 
Note that different currently prescribed substrings may correspond to the 
same new prescribed substring in the recursive call. 

The recursive call results in the processing of all the prescribed substrings 
a{i,j) where i, j are multiples of 2s; in other words, where - and - are both 
even. We then process all remaining prescribed substrings a(i : j) as follows: 



pB 

a{i:j), 



I s1 ,.. ..BP ,. l - even, i odd 

a(i:]-s),b a{j-s:j),b s ' s 

%., +s) , b ^% +s:j) , b l odd, j even 

%-, +S ), b Q % + s-,-s), b H %-s-,), b 'I Odd, j Odd 



The implicit distance products are computed by Corollary 1. In each case, 
the product is between two or three string-substring seaweed matrices: one 
for substring a(i : j — s), a(i+s : j) or a{i+s : j—s), already processed by the 
recursive call; and the other one or two for canonical substrings a(j — s : j) 
and/or a(i : i + s). 

(End of recursive step) 

Cost analysis. 

First phase. As in Algorithm 11, the total running time of this phase is 
0{mn). 

Second phase. In every level of the recursion, the number of matrix multipli- 
cations is at most O(k). The running time for each matrix multiplication is 
at most 0(n log m). The recursion has logm levels. Therefore, the running 
time is dominated by the top level of the recursion, where we have 0{m) 
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matrix multiplications, each running in time 0(n). Therefore, the running 
time of the whole phase is 0(klogm ■ nlogm) = 0(kn log 2 m). 

For values of k close to the fully-local case k = (™) = 0(m 2 ), a sharper 
analysis is possible. In this case, the running time of the whole phase is 
0(m 2 n). 

Total. The overall running time is dominated by the second phase, and is 
therefore 0(kn log 2 m). For values of k close to (™), the running time is 
0(m 2 n). ■ 

Note that in the fully- local case, the same asymptotic time can be ob- 
tained by running m independent instances of the seaweed algorithm (Al- 
gorithm 1), each instance computing the implicit highest-score matrices in- 
crementally for 0(n) different substrings of a. 

Example 31 Figure 9.2 shows an execution of Algorithm 11 on string a of 
length 16, with 16 prescribed substrings of various sizes, against string b of 
arbitrary length. Conventions are the same as in Figure 9.1. □ 

9.3 Sparse spliced alignment 

Assembling a gene from candidate exons is an important problem in compu- 
tational biology. Several alternative approaches to this problem have been 
developed over time. One of such approaches is spliced alignment by Gelfand 
et al. [57] (see also [63]), which scores different candidate exon chains within 
a DNA sequence by comparing them to a known related gene sequence. In 
this method, the two sequences are modelled respectively by strings a, b of 
lengths m, n respectively. A subset of substrings in string a are marked 
as candidate exons. The comparison between sequences is made by string 
alignment. The algorithm for spliced alignment given in [57] runs in time 
0(m?n). 

In general, the number of candidate exons k may be as high as (™) = 
0(m 2 ). The method of sparse spliced alignment makes a realistic assump- 
tion that, prior to the assembly, the set of candidate exons undergoes some 
filtering, after which only a small fraction of candidate exons remains. Kent 
et al. [78] give an algorithm for sparse spliced alignment that, in the spe- 
cial case k = 0(m), runs in time 0(m L5 n). By a direct application of the 
quasi-local LCS problem (Section 9.2), the running time can be reduced 
to 0(mn log 2 m). Sakai [116] gave an improved algorithm, running in time 
0{mn logn). 

For higher values of k, all the described algorithms provide a smooth 
transition in running time to the dense case k = (™). In this case, the 
algorithms' running time 0(m 2 n) is asymptotically equal to the algorithm 
of [57]. 
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We now describe an algorithm for sparse spliced alignment, based on 
the approach of [116]. We keep the notation and terminology of the pre- 
vious sections; in particular, candidate exons are represented by prescribed 
substrings of string a. We say that substring a(i' : /) precedes substring 
a(i" : j"), if f < i" . A precedence chain of substrings is a chain in the 
partial order of substring precedence. We identify every precedence chain 
with the string obtained by concatenating all its component substrings in 
the order of precedence. 

The algorithm uses a generalisation of the standard network alignment 
method, equivalent to the one used by [78]. For simplicity, we describe the 
algorithm under LCS score; using the blow-up technique of Section 5.1, the 
algorithm can be generalised to an arbitrary alignment score with rational 
weights. 

Algorithm 13 (Sparse spliced alignment) 

Input: strings a, b of length m, n, respectively; a set of k endpoint index 
pairs for the prescribed substrings in a. 

Output: the precedence chain of prescribed substrings in a, giving the high- 
est LCS score against string b. 

Description. The algorithm runs in two phases. 

First phase. As in Algorithm 11. 

Second phase. The problem is now solved by dynamic programming as fol- 
lows. Let Uj(s) denote the highest LCS score across all precedence chains of 
prescribed substrings in prefix string a] j, taken against prefix string b] s. 
With each j € [0 : n], we associate the integer vector Uj = Uj(*). We ini- 
tialise uq as the zero vector. We then compute the vectors Uj, j £ [1 : n], in 
the order of increasing j. Let ao = a(io : j), a\ = a(i\ : j) , . . . , at = (it '■ j) 
be all the prescribed substrings of a terminating at index j. We have 

Uj = uj-! © (u io H^ b ) © (u h H^ b ) © ... © (u it H*J (9.2) 

for all j G [1 : n\. 

The matrices b , b , . . . , b do not need to be computed ex- 
plicitly. Instead, it is straightforward to compute each of the vector-matrix 
distance products in (9.2) by up to logm instances of implicit vector-matrix 
distance product, using the decomposition of each of ao, a±, . . . , at into up to 
logm canonical substrings, along with the corresponding seaweed matrices 
obtained in the first phase. 

The solution score is now given by the value u m (n). The solution prece- 
dence chain of prescribed substrings can be obtained by tracing the dynamic 
programming sequence backwards from vector u m to vector u . 

Cost analysis. 
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First phase. As in Algorithm 11, the total running time of this phase is 
0{mn). 

Second phase. For each of k prescribed substrings of a, we execute up to 
log m instances of implicit matrix- vector distance multiplication. Every such 
instance runs in time 0(n) by Theorem 6. Therefore, the total running time 
of this phase is 0(kn log m). 

Total. The overall running time of the algorithm is dominated by the second 
phase, and is therefore 0(kn log m). ■ 

Similarly to Algorithm 12, a sharper analysis for k ps (™) leads to a 
smooth transition to the running time 0(m 2 n) in the dense case k = (™), 
which is asymptotically equal to the running time of the dense spliced algo- 
rithm of [57]. 
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Chapter 10 

Conclusions 



We have surveyed a number of existing and new algorithmic techniques and 
applications related to semi-local string comparison. Our approach unifies 
a substantial number of previously unrelated problems and techniques, and 
in many cases allows us to match or improve existing algorithms. It is likely 
that further development of this approach will give it even more scope and 
power. 

A number of questions related to the semi-local string comparison frame- 
work remain open. In particular, it is not yet clear whether the framework 
can be extended to arbitrary real costs, or to sequence alignment with non- 
linear gap penalties. 

In summary, semi-local string comparison turns out to be a useful algo- 
rithmic plug-in, which unifies, and often improves on, a number of previous 
approaches to various substring- and subsequence-related problems. 
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offset indentity, 9 
score 

prefix-suffix, 39 

semi-local, 35 

string-substring, 39 

substring-string, 39 

suffix-prefix, 39 
seaweed 

prefix-suffix, 39 

semi-local, 37 

string-substring, 39 

substring-string, 39 

suffix-prefix, 39 
seaweed cross-, 40 
simple, 8 

subpermutation, 9 
subunit-anti-Monge, 10 
subunit-Monge, 10 
totally monotone, 8 
unit-anti-Monge, 10 
unit-Monge, 10 

0-annihilator, 16 

0-identity, 16 

0-monoid, 14 

problem 

approximate matching, 60 

complete, 60 

output filtering, 61 

threshold, 61, 88 
block-incremental LCS, 53 
circle graph maximum clique, 
common-substring LCS, 54 

semi-local, 54 
cyclic LCS, 55 

between permutations, 71 
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episode matching, 61 
fully-incremental LCS, 52 
incremental LCS, 52 
linear graph maximum common 

pattern, 78 
longest /c-increasing subsequence, 

73 

longest /c-modal subsequence, 73 
longest y-avoiding subsequence, 
72 

longest common subsequence (LCS), 
31 

longest increasing subsequence (LIS), 
69 

longest repeating subsequence, 55 
periodic string-substring LCS, 64 
quasi-local LCS, 98 
semi-local LCS, 32 

three-way, 33, 82 
subsequence matching, 32 
subsequence recognition 

global, 32 

local, 32, 61, 84 
tandem alignment, 67 
tandem LCS, 67 
window-substring LCS, 95 
window-window LCS, 95 



subsequence, 31 
filtered, 69 

substring, 31 
canonical, 96 
cross-, 40 
prescribed, 94 
window, 94 

suffix, 31 



score 

Hamming, 94 
seaweed, 17 

braid, 17 

monoid, 18 
spliced alignment, 101 

sparse, 101 
straight-line program (SLP), 80 

statement, 80 
string 

grammar-compressed (GC), 80 
periodic, 64 
permutation, 69 

identity, 69 
prefix, 31 
reverse, 69 
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